Tuesday, November 27, 2018

Google+ Migration - Part V: Image Attachments

< - Part IV: Visibility Scope & Filtering

Google+ has always been rather good at dealing with photos - the photo functions were built on the foundation of Picasa and later spun out as Google Photos. Not surprising that the platform was popular with photographers and many posts contain photos.
In the takeout archive, photos or images/media file attachments to posts are rather challenging. In addition to the .json files containing each of the posts, the Takeout/Google+ Stream/Posts directory also includes two files for each image attached to a post. The basename is the originally uploaded filename, with a .jpg extension for the image file itself and a jpg.metadata.csv for for some additional information about the image.

If we originally attached an image cat.jpg to a post, there should now be a cat.jpg and cat.jpg.metadata.csv file in the post directory. However if over the years, we have been unimaginative in naming files and uploaded several cat.jpg images, there is now a name-clash that is resolved by the takeout archive by arbitrarily naming the files cat.jpg, cat(1).jpg, cat(2).jpg and so one.

The main challenge for reconstituting posts is to identify which image files is being references from which post.  The section of the JSON object which describes an image attachment looks like this example below. There is no explicit reference to the image filename in the archive nor does the metadata file contain the resourceName indicated here. There is a URL in the metadata file as well, but unfortunately it does not seem to match. The only heuristic left to try would be to take the last part of the URL path as an indication of the original filename and try to find a file with the same name. However this runs into the issue above with filename de-duplication where possibly the wrong photo would be linked to a post. For users with a combination of public and private post, such mixups could lead to very unintended consequences.


"media": {
      "url": "https://lh3.googleusercontent.com/-_liTfYo1Wys/W9SR4loPEyI/AAAAAAACBxA/wD82E3TKRdYBfEXwkExPkUOj0MY5lKCKQCJoC/w900-h1075/cat.jpg",
      "contentType": "image/*",
      "width": 900,
      "height": 1075,
      "resourceName": "media/CixBRjFRaXBPQ21aY2tlQ3h1OFVpamZJMDNpa0lqa1BsSmZ3b1ZNOWRvZlp2Qg\u003d\u003d"
    }

It appears that at in this time, we are unable to reliably reconstruct the post to image file reference reliably from the contents of archive. The alternative is to download each of the URLs referenced in the post data from the Google static content server for as long as these resources are still available.

Fortunately with the given URLs this is rather simple to do in Python. We can process the JSON files once again, find all the image references and download the images to a local cache where they are stored with filenames derived from the (presumably) unique resource names. For further re-formatting of the posts, we can then refer to the downloaded images by their new unique names.

We can use the filter command from the previous blog-post to select which post we are interested in (again all public posts in this case) and pipe the output into this scrip to build up the image cache:

ls ~/Desktop/Takeout/Google+\ Stream/Posts/*.json | ./post_filter.py --public --id communities/113390432655174294208 --id communities/103604153020461235235 --id communities/112164273001338979772 | ./image_cache.py --image-dir=./images


#!/usr/bin/env python

import argparse
import codecs
import json
import os
import sys
import urllib
import urlparse
import sys

def get_image_name(resource_name):
  return resource_name.replace('media/', '', 1) + '.jpg'

def process_image(media, image_dir):
  url = media['url']
  id = media['resourceName']
  if media['contentType'] != 'image/*':
    return
  if not url.startswith('http'): # patch for broken URLs...
    url = 'https:' + url
  target_name = os.path.join(image_dir, get_image_name(id))

  if os.path.exists(target_name):
    sys.stdout.write('.')
    sys.stdout.flush()
  else:
    print('retrieving %s as %s' % (url, target_name))
    urllib.urlretrieve(url, target_name)

# --------------------
parser = argparse.ArgumentParser(description='Collect post images referenced from a set of posts')
parser.add_argument('--image-dir', dest='image_dir', action='store', required=True)
args = parser.parse_args()

if not os.path.exists(args.image_dir):
  os.makedirs(args.image_dir)

sys.stdout = codecs.getwriter('utf8')(sys.stdout)

for filename in sys.stdin.readlines():
  filename = filename.strip()
  post = json.load(open(filename))  
  if 'media' in post:
    process_image(post['media'], args.image_dir)
  elif 'album' in post:
    for image in post['album']['media']:
      process_image(image, args.image_dir)