Sunday, October 28, 2018

Google+ Migration - Part II: Understanding the Takeout Archive

<- Part I: Takeout

Once we the takeout archive has been successfully generated we can download and unarchive/extract it to our local disks. At that point we should find a new directory called Takeout with the Google+ posts being located at the following directory location: Takeout/Google+ Stream/Posts.

This posts directory contains 3 types of files:
  • File containing data for each post in JSON format
  • Media files of images or videos uploaded and attached to posts, for example in JPG format
  • Metadata files for each media-file in CSV  forma with an additional extensions of .metadata.csv
The filenames are generated as part of the takeout archive generation process with the following conventions: the post filenames are structured as a date in YYYYMMDD format followed by a snippet of of the post text or the word "Post" if there is not text. The media filenames seem to be close to the original names of the files when they were uploaded.

Whenever a filename is not unique, an additional count is added like in these examples:

20180808 - Post(1).json
20180808 - Post(2).json
20180808 - Post(3).json
20180808 - Post.json
cake.jpg
cake(1).jpg


Filenames which contain unicode characters that are not in the base ASCII may not be correctly represented on all platforms and in particular appear corrupted in the .tgz archive. For the cases which I have been able to test, the default .zip encoded archive seems to handle unicode filenames correctly.

Each of the .json post files contains a JSON object with different named sub-objects which can themselves again be objects, lists of objects or elementary types like strings or numbers.

Based on the data which I have been able to analyze from my post archive, the post JSON object contains the following relevant sub-objects:
  • author - information about the user who created the post. In a personal takeout archive, this is always the same user.
  • creationTime and updateTime - timestamp of when post was originally created or last updated, respectively
  • content - text of the post in HTML like formatting
  • link, album, media or resharedPost etc. - post attachments of a certain type
  • location - location tag associated with post
  • plusOnes - record of users who have "plussed" the post
  • reshares - records of users who have shared the post
  • comments - record of comments, including comment author info as well as comment content
  • resourceName - unique ID of the post (also available for users, media and other objects)
  • postAcl - visibility of Post - e.g. public, part of a community or visible only to some circles or users.
In particular this list is missing the representation for collections or other post attachments like pools or events, as there are no examples for this among my posts.

An example JSON for a very simple post consisting of a single unformatted text line, an attached photo and a location tag - without any recorded post interactions:

{
  "url": "https://plus.google.com/+BernhardSuter/posts/hWpzTm3uYe3",
  "creationTime": "2018-07-15 20:43:51+0000",
  "updateTime": "2018-07-15 20:43:51+0000",
  "author": {
   ... 
  },
  "content": "1 WTC",
  "media": {
    "url": "https://lh3.googleusercontent.com/-IkMqxKEbkxs/W0uyBzAdMII/AAAAAAACAUs/uA8EmZCOdZkKGH5PN5Ct_Xj4oaY2ZNX3ACJoC/
                  w2838-h3785/gplus4828396160699980513.jpg",
    "contentType": "image/*",
    "width": 2838,
    "height": 3785,
    "description": "1 WTC",
    "resourceName": "media/CixBRjFRaXBNRUpjaXh6QTRyckdjNE5Nbmx5blVwTTBjd2lIblh3VWFCek0zMA\u003d\u003d"
  },
  "location": {
    "latitude": 40.7331168,
    "longitude": -74.0108977,
    "displayName": "Hudson River Park Trust",
    "physicalAddress": "353 West St, New York, NY 10011, USA"
  },
  "resourceName": "users/117832126248716550930/posts/UgjyOA5tNBvbgHgCoAEC",
  "postAcl": {
    "visibleToStandardAcl": {
      "circles": [{
        "type": "CIRCLE_TYPE_PUBLIC"
      }]
    }
  }
}

JSON is a simple standard data format that can easily be processed programmatically and many supporting libraries already exist. The Python standard library contains a module to parse JSON files and expose the data as native Python data objects to the code for further inspection and processing.

For example this simple Python program below can be used to determine whether a post has public visibility or not:

#!/usr/bin/python

import json
import sys

def is_public(acl):
  """Return True, if access control object contains the PUBLIC pseudo-circle."""
  if ('visibleToStandardAcl' in acl
      and 'circles' in acl['visibleToStandardAcl']):
    for circle in acl['visibleToStandardAcl']['circles']:
      if circle['type'] == 'CIRCLE_TYPE_PUBLIC':
        return True
  return False

# filter out only the posts which have public visibility.
for filename in sys.argv[1:]:
  post = json.load(open(filename))
  if is_public(post['postAcl']):
    print (filename)


Running this as ./public_posts.py ~/Download/Takeout/Google+\ Stream/Posts/*.json would return the list of filenames which contain the publicly visible posts only. By successfully parsing all the .json files in the (i.e. without trowing any errors), we can also convince ourselves that the archive contains data in syntactically valid JSON format.