Sunday, December 9, 2018

Google+ Migration - Part VII: Conversion & Staging

<- Part VI: Location, Location, Location 

We are now ready to put all the pieces together for exporting to Diaspora*, the new target platform.

If we had some sort of "Minitrue" permissions to rewrite history on the target system, the imported posts could appear to always have been there since their original G+ posting date.

However since we have only have regular user permissions, the only choice is to post them as new posts at some future point in time. The most straightforward way to upload the archive would be to re-post in chronological order as quickly as possible without causing overload.

If the new account is not only used for archive purposes, we may want to maximize the relevance of the archive posts in the new stream. In this case, a better way would be to post each archive post on the anniversary of its original post-date, creating some sort of "this day in history" series. This would require that the upload activity needs to be scheduled over at least a year, causing some operational challenges.

In order to minimize the risk of things going wrong with with generating the new posts during this drawn out,  hopefully unattended and automated posting process, we are trying to do as much of the conversion in a single batch and stage the converted output to be uploaded/posted to the destination system at some planned future time. This would also allow for easier inspection of the generated output or to adapt the process for a different destination system, e.g. a blog.

The following python script read a list of post filenames from the takeout archive, extracts relevant information from the JSON object in each file and generates the new post content in Markdown format. Besides being the input format for Diaspora*, Markdown is widely used and can also easily be converted into other formats, including HTML. The list of posts we want to export can be generated using the post_filter.py script from part IV in this series. We also have downloaded the images references in any of these posts using the image_cache.py script from part V and stored them in a location like /tmp/images.

Most of my posts are either photo or link sharing, with just a line or two of commentary. More towards a twitter use-case than he long-form posts that G+ would support equally well. The script contains several assumptions that are optimized for this use-case. For example HTML links are stripped from the text content, assuming that each post only has one prominent link that is being shared. Many of my photo sharing posts contain location information, which is extracted here into additional hashtags as well as an additional location link on OpenStreetMap.

Hashtags are a more central concept on Diaspora* than they were on G+. Other than some static pre-defined hashtags to identify the posts as an automated repost from G+, there are additional hashtags that are added based on the type of post - e.g. photo sharing, stripped down re-sharing of another post, sharing of a YouTube video or high level geo location info.

Before running the conversion & staging script, we need to decide which day in the future we want to start posting the archive. Given an staging directory, e.g. /tmp/staging_for_diaspora, the script will create a sub-directory for each day that contains scheduled post activity. In each daily schedule directory, the script creates a unique sub-directory containing a content.md file with the new post text in Markdown as well as any images to be attached. The unique name for each post consists of the date of the original post data plus what seems to be a unique ID in the post URL, in absence of a real unique post ID in the JSON file. For example a post originally posted on Jul 14 2018, would be stored in /tmp/stage_for_diaspora/20190714/20180714_C3RUWSDE7X7/content.md formatted as:

Port Authority Inland Terminal - from freight hub to Internet switching center.

#repost #bot #gplusarchive #googleplus #throwback #photo #photography #US #UnitedStates #NYC

[111 8th Ave](https://www.openstreetmap.org/?lat=40.7414688&lon=-74.0033873&zoom=17)
Originally posted Sat Jul 14, 2018 on Google+ (Alte Städte / Old Towns)

Or the post which shared the link to the first part of this series would be re-formatted as:

Starting to document the process of migrating my public post stream to diaspora*.  
  
The plan is to process the takeout archive in Python and generate (somewhat) equivalent diaspora* posts using diaspy.  

#repost #bot #gplusarchive #googleplus #throwback

Originally posted Sun Oct 21, 2018 on Google+  (Google+ Mass Migration)

https://blog.kugelfish.com/2018/10/google-migration-part-i-takeout.html

The script also checks the current status of link URLs to avoid sharing a broken link. While we tell our children to be careful since "the Internet never forgets", in reality many links are gone after just a few years - the whole G+ site soon being an example of that.

Since Disapora* is not particularly well optimized for photo-processing and to help save storage cost on the pod server, the script can also downscale images to a fixed maximum size that is suitable for on-screen display.

For example by running the script as
./post_transformer.py --image-dir=/tmp/images --staging-dir=/tmp/stage_for_diaspora --start-date=20191001 --image-size=1024 < /tmp/public_posts.txt
we are assuming that we want to start publishing on Oct 1 2019 that images are located in /tmp/images and should be limited to a maximum size of 1024 pixels for publishing and the whole output will be staged in /tmp/stage_for_diaspora.

Since this script does not do any posting itself, we can run it as many times as we need to, inspect the output and make some adjustments as necessary. Link URL checking and geo-coding (see part VI) require network access from the machine where the script is being executed. In principle, we could manually post the generated output to some target system, but in a future episode, we will demonstrated a fully automated way of posting to diaspora, assuming that

In addition to what is already included in the Python standard library (2.7) we need the following additional packages:
Which can be installed for example using PIP: pip install python-dateutil geopy html2text Pillow pycountry requests


#!/usr/bin/env python
# -*- coding: utf-8 -*-

import argparse
import codecs
import datetime
import io
import json
import os
import sys

import dateutil.parser
import geopy.geocoders
import html2text
import PIL.Image 
import pycountry
import requests

ISO_DATE = '%Y%m%d'

HASHTAGS = ['repost', 'bot', 'gplusarchive', 'googleplus', 'throwback']

geocoder = geopy.geocoders.Nominatim(user_agent='gplus_migration', timeout=None)

def get_location_hashtags(loc):
  """Return hashtags related to the location of the post: ISO country code, country name, city/town."""
  hashtags = []
  if 'latitude' in loc and 'longitude' in loc:
    addr = geocoder.reverse((loc['latitude'], loc['longitude'])).raw
    if 'address' in addr:
      addr = addr['address']
      cc = addr['country_code'].upper()
      hashtags.append(cc)
      hashtags.append(pycountry.countries.get(alpha_2=cc).name.replace(' ',''))
      for location in ['city', 'town', 'village']:
        if location in addr:
          hashtags.append(addr[location].replace(' ', ''))
  return hashtags


def get_location_link(loc):
  """Return a link to OpenStreetMap for the post location."""
  if 'latitude' in loc and 'longitude' in loc and 'displayName' in loc:
    map_url = ('https://www.openstreetmap.org/?lat=%s&lon=%s&zoom=17' % (loc['latitude'], loc['longitude']))
    return '[%s](%s)' % (loc['displayName'], map_url)
  else:
    return None


def validate_url(url):
  """Veify whether a URL still exists, including a potential redirect."""
  user_agent = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5)'
                 + ' AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
  try:
    r = requests.get(url, headers=user_agent)
    if r.status_code != 200:
      return None
    return r.url
  except requests.ConnectionError:
    return None
  

def get_image_name(resource_name):
  """Generate image cache name for media resource-name."""
  return resource_name.replace('media/', '', 1) + '.jpg'


def copy_downscale_image(source, destination, max_size):
  """Copy a downscaled version of the image to the staging location."""
  img = PIL.Image.open(source)
  source_size = max(img.size[0], img.size[1])
  if not max_size or source_size <= max_size:
    img.save(destination, 'JPEG2000') 
  else:
    scale = float(max_size) / float(source_size)
    img = img.resize((int(img.size[0] * scale), int(img.size[1] * scale)), PIL.Image.LANCZOS)
    img.save(destination, 'JPEG2000')


def parse_post(post_json):
  """Extract relevant information from a JSON formatted post."""
  post_date = dateutil.parser.parse(post_json['creationTime'])
  content = post_json['content'] if 'content' in post_json else ''
  link = post_json['link']['url'] if 'link' in post_json else ''

  hashtags = HASHTAGS[:] # make a copy
  images = []

  if 'media' in post_json:
    media = post_json['media']
    if media['contentType'] == 'video/*' and 'youtube' in media['url']:
    # if the media is a youtube URL, convert into a link-sharing post
      link = media['url']
      hashtags = hashtags + ['video', 'YouTube']
    elif media['contentType'] == 'image/*':
      hashtags.extend(['photo', 'photography'])
      images.append(get_image_name(media['resourceName']))
    else:
      return None # unsupported media format

  if 'album' in post_json:
    hashtags = hashtags + ['photo', 'photography']
    for image in post['album']['media']:
      if image['contentType'] == 'image/*':
        images.append(get_image_name(image['resourceName']))
    if len(images) == 0:
      return None # no supported image attachment in album

  # If a shared post contains a link, extract that link
  # and give credit to original poster.
  if 'resharedPost' in post_json:
    if 'link' in post_json['resharedPost']:
      link = post_json['resharedPost']['link']['url']
      content = content + ' - H/t to ' + post_json['resharedPost']['author']['displayName']
      hashtags.append('reshared')
    else:
      return None # reshare without a link attachment

  acl = post_json['postAcl']
  post_context = {}
  if 'communityAcl' in acl:
    post_context['community'] = acl['communityAcl']['community']['displayName']

  if 'location' in post_json:
    hashtags.extend(get_location_hashtags(post_json['location']))
    location_link = get_location_link(post_json['location'])
    if location_link:
      post_context['location'] = location_link

  return (content, link, hashtags, post_date, post_context, images)


def format_content(content, link, hashtags, post_date, post_context):
  """Generated a Markdown formatted string from the pieces of a post."""
  output = []
  if content:
    converter = html2text.HTML2Text()
    converter.ignore_links = True
    converter.body_width = 0
    output.append(converter.handle(content))
  if hashtags:
    output.append(' '.join(('#' + tag for tag in hashtags)))
    output.append('')
  if 'location' in post_context:
    output.append(post_context['location'])
  if post_date:
    output.append('Originally posted %s on Google+ %s' 
                    % (post_date.strftime('%a %b %d, %Y'),
                       '  (' + post_context['community'] + ')' if 'community' in post_context else ''))
    output.append('')
  if link:
    output.append(link)
    output.append('')
  return u'\n'.join(output)


def get_post_directory(outdir, post_date, start_date, url):
  """Generate staging output directory based on schedule date & post unique ID."""
  post_id = post_date.strftime(ISO_DATE) + '_' + url.split('/')[-1]
  schedule_date = post_date.replace(year=start_date.year, tzinfo=None)
  if schedule_date < start_date:
    schedule_date = schedule_date.replace(year=schedule_date.year + 1)
  return os.path.join(outdir, schedule_date.strftime(ISO_DATE), post_id)
  

# --------------------
parser = argparse.ArgumentParser(description='Coolect post images referenced from a set of posts')
parser.add_argument('--image-dir', dest='image_dir', action='store', required=True)
parser.add_argument('--staging-dir', dest='staging_dir', action='store', required=True)
parser.add_argument('--image-size', dest='image_size', action='store', type=int)
parser.add_argument('--start-date', dest='start_date', action='store', type=dateutil.parser.parse, required=True)
parser.add_argument('--refresh', dest='refresh', action='store_true')
args = parser.parse_args()

if not os.path.exists(args.image_dir):
  sys.stderr.write('image-dir not found: ' + args.image_dir + '\n')
  sys.exit(-1)

sys.stdout = codecs.getwriter('utf8')(sys.stdout)

print ('staging directory: %s' % args.staging_dir)
print ('publish start date: %s' % args.start_date.strftime(ISO_DATE))

count = 0
for filename in sys.stdin.readlines():
  filename = filename.strip()
  post = json.load(open(filename))
  post_data = parse_post(post)

  if post_data:
    content, link, hashtags, post_date, post_context, images = post_data
    post_dir = get_post_directory(args.staging_dir, post_date, args.start_date, post['url'])

    if not args.refresh and os.path.exists(post_dir):
      continue

    # Avoid exporting posts with stale links.
    if link:
      link = validate_url(link)
      if not link:
        print ('\nURL %s not found, skipping export for %s' % (post_data[1], post_dir))
        continue

    # Output content in Markdown format to staging location.
    if not os.path.exists(post_dir):
      os.makedirs(post_dir)
     
    content_file = io.open(os.path.join(post_dir, 'content.md'), 'w', encoding='utf-8')
    content_file.write(format_content(content, link, hashtags, post_date, post_context))
    content_file.close()

    for i, image in enumerate(images):
      source = os.path.join(args.image_dir, image)
      destination = os.path.join(post_dir, 'img_%d.jpg' % i)
      copy_downscale_image(source, destination, args.image_size)
      
    count += 1
    sys.stdout.write('.')
    sys.stdout.flush()
    
print ('%d posts exported to %s' % (count, args.staging_dir))    


Thursday, November 29, 2018

Google+ Migration - Part VI: Location, Location, Location!

<- Image Attachments

Before we focus on putting all the pieces together, here a small, optional excursion into how to make use of location information contained in G+ posts.

We should consider carefully if and how we want to include geo location information as there might be privacy and safety implications. For such locations, it can make sense to choose the point of a nearby landmark or add some random noise to the location coordinates.

Many of my public photo sharing post containing the location of near where the photos where taken. Diaspora* posts can contain a location tag as well, but it does not seem to be very informative and the diaspy API currently does not support adding post a post location.

Instead we can process the location information contained in the post takeout JSON files and transform it to extract some information which we can use to format the new posts.

In particular, we want to include a location link to the corresponding location on Openstreetmap as well as generate some additional hashtags from the location information, e.g. which country or city the post relates to.

Using the longitude & latitude coordinates from the location info, we can directly link to the corresponding location for example on Openstreetmap or other online mapping services.

"location": {
    "latitude": 40.7414688,
    "longitude": -74.0033873,
    "displayName": "111 8th Ave",
    "physicalAddress": "111 8th Ave, New York, NY 10011, USA"
  }

In order to extract hierarchical location information like the country or the city of the location, we are calling the reverse-geocoding API of Openstreetmap with the coordinates to find the nearest recorded address of that point. To simply calling the web-api, we can use the geopy library (install for example with pip install geopy).

From various components of the address, we can generate location hashtags that help define the context of the post. The use of the additional pycountry module which contains a library of canonical country names by ISO-3166 country-codes is entirely optional but helps to create a more consistent label.

For the location record above, we can generate the following additional content snippets:

#US #UnitedStates #NYC


#!/usr/bin/env python

import codecs
import geopy.geocoders
import json
import pycountry
import sys

geocoder = geopy.geocoders.Nominatim(user_agent='gplus_migration', timeout=None)

def get_location_hashtags(loc):
  hashtags = []
  if 'latitude' in loc and 'longitude' in loc:
    addr = geocoder.reverse((loc['latitude'], loc['longitude'])).raw
    if 'address' in addr:
      addr = addr['address']
      cc = addr['country_code'].upper()
      hashtags.append(cc)
      hashtags.append(pycountry.countries.get(alpha_2=cc).name.replace(' ',''))
      for location in ['city', 'town', 'village']:
        if location in addr:
          hashtags.append(addr[location].replace(' ', ''))
  return hashtags

def get_location_link(loc):
  if 'latitude' in loc and 'longitude' in loc and 'displayName' in loc:
    map_url = ('https://www.openstreetmap.org/?lat=%s&lon=%s&zoom=17' % (loc['latitude'], loc['longitude']))
    return '[%s](%s)' % (loc['displayName'], map_url)

sys.stdout = codecs.getwriter('utf8')(sys.stdout)

for filename in sys.stdin.readlines():
  filename = filename.strip()
  post = json.load(open(filename))  
  if 'location' in post:
    print(' '.join(('#' + tag for tag in get_location_hashtags(post['location']))))
    print(get_location_link(post['location']))

Tuesday, November 27, 2018

Google+ Migration - Part V: Image Attachments

< - Part IV: Visibility Scope & Filtering

Google+ has always been rather good at dealing with photos - the photo functions were built on the foundation of Picasa and later spun out as Google Photos. Not surprising that the platform was popular with photographers and many posts contain photos.
In the takeout archive, photos or images/media file attachments to posts are rather challenging. In addition to the .json files containing each of the posts, the Takeout/Google+ Stream/Posts directory also includes two files for each image attached to a post. The basename is the originally uploaded filename, with a .jpg extension for the image file itself and a jpg.metadata.csv for for some additional information about the image.

If we originally attached an image cat.jpg to a post, there should now be a cat.jpg and cat.jpg.metadata.csv file in the post directory. However if over the years, we have been unimaginative in naming files and uploaded several cat.jpg images, there is now a name-clash that is resolved by the takeout archive by arbitrarily naming the files cat.jpg, cat(1).jpg, cat(2).jpg and so one.

The main challenge for reconstituting posts is to identify which image files is being references from which post.  The section of the JSON object which describes an image attachment looks like this example below. There is no explicit reference to the image filename in the archive nor does the metadata file contain the resourceName indicated here. There is a URL in the metadata file as well, but unfortunately it does not seem to match. The only heuristic left to try would be to take the last part of the URL path as an indication of the original filename and try to find a file with the same name. However this runs into the issue above with filename de-duplication where possibly the wrong photo would be linked to a post. For users with a combination of public and private post, such mixups could lead to very unintended consequences.


"media": {
      "url": "https://lh3.googleusercontent.com/-_liTfYo1Wys/W9SR4loPEyI/AAAAAAACBxA/wD82E3TKRdYBfEXwkExPkUOj0MY5lKCKQCJoC/w900-h1075/cat.jpg",
      "contentType": "image/*",
      "width": 900,
      "height": 1075,
      "resourceName": "media/CixBRjFRaXBPQ21aY2tlQ3h1OFVpamZJMDNpa0lqa1BsSmZ3b1ZNOWRvZlp2Qg\u003d\u003d"
    }

It appears that at in this time, we are unable to reliably reconstruct the post to image file reference reliably from the contents of archive. The alternative is to download each of the URLs referenced in the post data from the Google static content server for as long as these resources are still available.

Fortunately with the given URLs this is rather simple to do in Python. We can process the JSON files once again, find all the image references and download the images to a local cache where they are stored with filenames derived from the (presumably) unique resource names. For further re-formatting of the posts, we can then refer to the downloaded images by their new unique names.

We can use the filter command from the previous blog-post to select which post we are interested in (again all public posts in this case) and pipe the output into this scrip to build up the image cache:

ls ~/Desktop/Takeout/Google+\ Stream/Posts/*.json | ./post_filter.py --public --id communities/113390432655174294208 --id communities/103604153020461235235 --id communities/112164273001338979772 | ./image_cache.py --image-dir=./images


#!/usr/bin/env python

import argparse
import codecs
import json
import os
import sys
import urllib
import urlparse
import sys

def get_image_name(resource_name):
  return resource_name.replace('media/', '', 1) + '.jpg'

def process_image(media, image_dir):
  url = media['url']
  id = media['resourceName']
  if media['contentType'] != 'image/*':
    return
  if not url.startswith('http'): # patch for broken URLs...
    url = 'https:' + url
  target_name = os.path.join(image_dir, get_image_name(id))

  if os.path.exists(target_name):
    sys.stdout.write('.')
    sys.stdout.flush()
  else:
    print('retrieving %s as %s' % (url, target_name))
    urllib.urlretrieve(url, target_name)

# --------------------
parser = argparse.ArgumentParser(description='Collect post images referenced from a set of posts')
parser.add_argument('--image-dir', dest='image_dir', action='store', required=True)
args = parser.parse_args()

if not os.path.exists(args.image_dir):
  os.makedirs(args.image_dir)

sys.stdout = codecs.getwriter('utf8')(sys.stdout)

for filename in sys.stdin.readlines():
  filename = filename.strip()
  post = json.load(open(filename))  
  if 'media' in post:
    process_image(post['media'], args.image_dir)
  elif 'album' in post:
    for image in post['album']['media']:
      process_image(image, args.image_dir)


Sunday, November 18, 2018

Google+ Migration - Part IV: Visibility Scope & Filtering

<- Part III: Content Transformation

Circles and with them the ability to share different content with different sets of people was one of the big differentiators of Google+ over other platforms at the time, which typically had a fixed sharing model and visibility scope.

Circles were based on the observation that most people in real life interact with several "social circles" and often would not want these circles to mix. The idea of Google+ was that it should be possible to manage all these different circles under a single online identity (which should also match the "real name" identity of our governments civil registry).

It turns out that while the observation of disjoint social circles was correct, most users prefer to use different platform and online identities to manage to make sure they don't inadvertently mix. Google+ tried hard to make sharing scopes obvious and unsurprising, but the model remained complex, hard to understand and accidents were only ever one or two mouse-clicks away.

Nevertheless, many takeout archives may contain posts that were intended for very different audiences and have different visibility that may still matter deeply to users. We are presenting here a tool that could help to analyze the sharing scopes that are present in a takeout archive and partition its content by selecting any subset of them.

The access control section (ACL) of each post has grown even more complex over time with the introduction of communities and collections. In particular there seem to be following distinct ways of defining the visibility of a post (some of which can be combined):
  • Public
  • Shared with all my circles
  • Shared with my extended circles (user in all my circles and their circles, presumably)
  • Shared with a particular circle
  • Shared with a particular user
  • Part of a collection (private or public)
  • Part of a community (closed or public)
Since my archive does not contain all these combinations, the code for processing the JSON definition of the post sharing and visibility scope is based on the following inferred schema definition. Please report if you encounter any exception from this structure.

After saving the following Python code in a file, e.g. post_filter.py and making it executable (chmod +x post_filter.py) we can start by analyzing the existing visibility scopes that exist in a list of post archive files:

$ ls ~/Desktop/Takeout/Google+\ Stream/Posts/*.json | ./post_filter.py
1249 - PUBLIC 
227 - CIRCLE (Personal): circles/117832126248716550930-4eaf56378h22b473 
26 - ALL CIRCLES 
20 - COMMUNITY (Alte Städte / Old Towns): communities/103604153020461235235 
15 - EXTENDED CIRCLES 
9 - COMMUNITY (Raspberry Pi): communities/113390432655174294208 
1 - COMMUNITY (Google+ Mass Migration): communities/112164273001338979772 
1 - COMMUNITY (Free Open Source Software in Schools): communities/100565566447202592471

For my own purposes, I would consider all public posts as well as posts to public communities as essentially public and any posts that were restricted to any circles as essentially private. By carefully copying the community IDs from the output above, we can create the following filter condition to selection only the filenames of these essentially public posts from the archive:

ls ~/Desktop/Takeout/Google+\ Stream/Posts/*.json | ./post_filter.py --public --id communities/113390432655174294208 --id communities/103604153020461235235 --id communities/112164273001338979772

We can then use the resulting list of filenames to only process post which are meant to be public. In a similar way, we could also extract posts that were shared with a particular circle or community, e.g. to assist in building a joint post archive for a particular community across its members.

#!/usr/bin/env python

import argparse
import codecs
import json
import optparse
import sys

class Visibility:
  PUBLIC = 'PUBLIC'
  CIRCLES = 'ALL CIRCLES'
  EXTENDED = 'EXTENDED CIRCLES'
  CIRCLE = 'CIRCLE'
  COLLECTION = 'COLLECTION'
  COMMUNITY = 'COMMUNITY'
  USER = 'USER'
  EVENT = 'EVENT'

def parse_acl(acl):
  result = []

  # Post is public or has a visiblility defined by circles and/or users.
  if 'visibleToStandardAcl' in acl:
    if 'circles' in acl['visibleToStandardAcl']:
      for circle in acl['visibleToStandardAcl']['circles']:
        if circle['type'] == 'CIRCLE_TYPE_PUBLIC':
          result.append((Visibility.PUBLIC, None, None))
        elif circle['type'] == 'CIRCLE_TYPE_YOUR_CIRCLES':
          result.append((Visibility.CIRCLES, None, None))
        elif circle['type'] == 'CIRCLE_TYPE_EXTENDED_CIRCLES':
          result.append((Visibility.EXTENDED, None, None))
        elif circle['type'] == 'CIRCLE_TYPE_USER_CIRCLE':
          result.append((Visibility.CIRCLE, circle['resourceName'], circle.get('displayName', '')))
    if 'users' in acl['visibleToStandardAcl']:
      for user in acl['visibleToStandardAcl']['users']:
        result.append((Visibility.USER, user['resourceName'], user.get('displayName', '-')))

  # Post is part of a collection (could be public or private).
  if 'collectionAcl' in acl:
    collection = acl['collectionAcl']['collection']
    result.append((Visibility.COLLECTION, collection['resourceName'], collection.get('displayName', '-')))

  # Post is part of a community (could be public or closed).
  if 'communityAcl' in acl:
    community = acl['communityAcl']['community']
    result.append((Visibility.COMMUNITY, community['resourceName'], community.get('displayName', '-')))
    if 'users' in acl['communityAcl']:
      for user in acl['communityAcl']['users']:
        result.append((Visibility.USER, user['resourceName'], user.get('displayName', '-')))

  # Post is part of an event.
  if 'eventAcl' in acl:
    event = acl['eventAcl']['event']
    result.append((Visibility.EVENT, event['resourceName'], user.get('displayName', '-')))

  return result


#---------------------------------------------------------
parser = argparse.ArgumentParser(description='Filter G+ post JSON file by visibility')
parser.add_argument('--public', dest='scopes', action='append_const', const=Visibility.PUBLIC) 
parser.add_argument('--circles', dest='scopes',action='append_const', const=Visibility.CIRCLES)
parser.add_argument('--ext-circles', dest='scopes',action='append_const', const=Visibility.EXTENDED)
parser.add_argument('--id', dest='scopes',action='append')

args = parser.parse_args()
scopes = frozenset(args.scopes) if args.scopes != None else frozenset()

stats = {}
for filename in sys.stdin.readlines():
  filename = filename.strip()
  post = json.load(open(filename))  
  acls = parse_acl(post['postAcl'])
  for acl in acls:
    if len(scopes) == 0:
      stats[acl] = stats.get(acl, 0) + 1
    else:
      if acl[0] in (Visibility.PUBLIC, Visibility.CIRCLES, Visibility.EXTENDED) and acl[0] in scopes:
        print (filename)
      elif acl[1] in scopes:
        print (filename)
          
if len(scopes) == 0:
  sys.stdout = codecs.getwriter('utf8')(sys.stdout)
  for item in sorted(stats.items(), reverse=True, key=lambda x: x[1]):
    if item[0][0] in (Visibility.PUBLIC, Visibility.CIRCLES, Visibility.EXTENDED):
      print ('%d - %s' % (item[1], item[0][0]))
    else:
      print ('%d - %s (%s):\t %s' % (item[1], item[0][0], item[0][2], item[0][1]))




Sunday, November 11, 2018

Google+ Migration - Part III: Content Transformation

<- Part II: Understanding the takeout archive 

After we have had a look at the structure of the takeout archive, we can build some scripts to translate the content of the JSON post description into a format that is suitable for import into the target system, which in our case is Diaspora*.

The following script is a proof of concept conversion of a single post file from the takeout archive to text string that is suitable for upload to a Diaspora* server using the diaspy API.

Images are more challenging and will be handled separately in a later episode. There is also no verification on whether the original post had public visibility and should be re-posted publicly.

The main focus is on the parse_post and format_post methods. The purpose of the parse_post method is to extract the desired information from the JSON representation of a post, while the format_post method uses this data to format the input text needed to create a more or less equivalent post.

While the the post content text in the google+ takeout archive is formatted in pseudo-HTML, Diaspora* post are formatted in Markdown. In order to convert the HTML input to Markdown output, we can use the html2text Python library.

Given the difference in formatting and conventions, there is really no right or wrong way to reformat each post, but a matter of choice.

The choices made here are:

  • If the original post contained text, the text is included at the top of the post with minimal formatting and any URL links stripped out. Google+ posts may include +<username> reference which may look odd. Hashtags should be automatically re-hashtagified on the new system, as long as it uses the hashtag convention.
  • The post includes a series of static hashtags which identify it as a archived, re-posted from G+. Additional hashtags can be generated during the parsing process, e.g. to identify re-shares of photos
  • The original post date and optional community or collection names are included with each post, as we intend to make it obvious that this is a re-posted archive and not a transparent migration.
  • Link attachments are added at the end and should be rendered as a proper link attachment with preview snipped and image if supported - presumably by using something like the OpenGraph markup annotations of the linked page.
  • Deliberately not include any data which results from post activity by other users including likes or re-shares. The only exception is that if a re-shared post includes and external link, this link is included in the post with a "hat tip" to the original poster using their G+ display name at the time of export.

The functionality to post to Diaspora* is included at this time merely as a demonstration that this can indeed work and is not intended to be used without additional operational safeguards.

#!/usr/bin/env python

import datetime
import json
import sys

import dateutil.parser
import diaspy
import html2text

SERVER = '<your diaspora server URL>'
USERNAME = '<your diaspora username>'
PASSWORD = '<not really a good idea...>'

TOOL_NAME = 'G+ repost'
HASHTAGS = ['repost', 'gplusarchive', 'googleplus', 'gplusrefugees', 'plexodus']


def post_to_diaspora(content, filenames=[]):
  c = diaspy.connection.Connection(pod = SERVER,
                                   username = USERNAME,
                                   password = PASSWORD)
  c.login()
  stream = diaspy.streams.Stream(c)
  stream.post(content, provider_display_name = TOOL_NAME)


def format_post(content, link, hashtags, post_date, post_context):
    output = []

    if content:
        converter = html2text.HTML2Text()
        converter.ignore_links = True
        converter.body_width = 0
        output.append(converter.handle(content))
    
    if hashtags:
        output.append(' '.join(('#' + tag for tag in hashtags)))
        output.append('')

    if post_date:
        output.append('Originally posted on Google+ on %s%s' 
                      % (post_date.strftime('%a %b %d, %Y'),
                         '  ( ' + post_context + ')' if post_context else ''))
        output.append('')

    if link:
        output.append(link)

    return '\n'.join(output)


def parse_post(post_json):
    post_date = dateutil.parser.parse(post_json['creationTime'])
    content = post_json['content'] if 'content' in post_json else ''
    link = post_json['link']['url'] if 'link' in post_json else ''

    hashtags = HASHTAGS

    # TODO: Dealing with images later...
    if 'album' in post_json or 'media' in post_json:
        hashtags = hashtags + ['photo', 'photography']

    # If a shared post contains a link, extract that link
    # and give credit to original poster.
    if 'resharedPost' in post_json and 'link' in post_json['resharedPost']:
        link = post_json['resharedPost']['link']['url']
        content = content + ' - H/t to ' + post_json['resharedPost']['author']['displayName']
        hashtags.append('reshared')

    acl = post_json['postAcl']
    post_context = ''
    if 'communityAcl' in acl:
        post_context = acl['communityAcl']['community']['displayName']

    return format_post(content, link, hashtags, post_date, post_context)


# ----------------------
filename = sys.argv[1]
post_json = json.load(open(filename))
print(parse_post(post_json))

if len(sys.argv) > 2 and sys.argv[2] == 'repost':
    print ('posting to %s as %s' % (SERVER, USERNAME))
    post_to_diaspora(parse_post(post_json))


Sunday, October 28, 2018

Google+ Migration - Part II: Understanding the Takeout Archive

<- Part I: Takeout

Once we the takeout archive has been successfully generated we can download and unarchive/extract it to our local disks. At that point we should find a new directory called Takeout with the Google+ posts being located at the following directory location: Takeout/Google+ Stream/Posts.

This posts directory contains 3 types of files:
  • File containing data for each post in JSON format
  • Media files of images or videos uploaded and attached to posts, for example in JPG format
  • Metadata files for each media-file in CSV  forma with an additional extensions of .metadata.csv
The filenames are generated as part of the takeout archive generation process with the following conventions: the post filenames are structured as a date in YYYYMMDD format followed by a snippet of of the post text or the word "Post" if there is not text. The media filenames seem to be close to the original names of the files when they were uploaded.

Whenever a filename is not unique, an additional count is added like in these examples:

20180808 - Post(1).json
20180808 - Post(2).json
20180808 - Post(3).json
20180808 - Post.json
cake.jpg
cake(1).jpg


Filenames which contain unicode characters that are not in the base ASCII may not be correctly represented on all platforms and in particular appear corrupted in the .tgz archive. For the cases which I have been able to test, the default .zip encoded archive seems to handle unicode filenames correctly.

Each of the .json post files contains a JSON object with different named sub-objects which can themselves again be objects, lists of objects or elementary types like strings or numbers.

Based on the data which I have been able to analyze from my post archive, the post JSON object contains the following relevant sub-objects:
  • author - information about the user who created the post. In a personal takeout archive, this is always the same user.
  • creationTime and updateTime - timestamp of when post was originally created or last updated, respectively
  • content - text of the post in HTML like formatting
  • link, album, media or resharedPost etc. - post attachments of a certain type
  • location - location tag associated with post
  • plusOnes - record of users who have "plussed" the post
  • reshares - records of users who have shared the post
  • comments - record of comments, including comment author info as well as comment content
  • resourceName - unique ID of the post (also available for users, media and other objects)
  • postAcl - visibility of Post - e.g. public, part of a community or visible only to some circles or users.
In particular this list is missing the representation for collections or other post attachments like pools or events, as there are no examples for this among my posts.

An example JSON for a very simple post consisting of a single unformatted text line, an attached photo and a location tag - without any recorded post interactions:

{
  "url": "https://plus.google.com/+BernhardSuter/posts/hWpzTm3uYe3",
  "creationTime": "2018-07-15 20:43:51+0000",
  "updateTime": "2018-07-15 20:43:51+0000",
  "author": {
   ... 
  },
  "content": "1 WTC",
  "media": {
    "url": "https://lh3.googleusercontent.com/-IkMqxKEbkxs/W0uyBzAdMII/AAAAAAACAUs/uA8EmZCOdZkKGH5PN5Ct_Xj4oaY2ZNX3ACJoC/
                  w2838-h3785/gplus4828396160699980513.jpg",
    "contentType": "image/*",
    "width": 2838,
    "height": 3785,
    "description": "1 WTC",
    "resourceName": "media/CixBRjFRaXBNRUpjaXh6QTRyckdjNE5Nbmx5blVwTTBjd2lIblh3VWFCek0zMA\u003d\u003d"
  },
  "location": {
    "latitude": 40.7331168,
    "longitude": -74.0108977,
    "displayName": "Hudson River Park Trust",
    "physicalAddress": "353 West St, New York, NY 10011, USA"
  },
  "resourceName": "users/117832126248716550930/posts/UgjyOA5tNBvbgHgCoAEC",
  "postAcl": {
    "visibleToStandardAcl": {
      "circles": [{
        "type": "CIRCLE_TYPE_PUBLIC"
      }]
    }
  }
}

JSON is a simple standard data format that can easily be processed programmatically and many supporting libraries already exist. The Python standard library contains a module to parse JSON files and expose the data as native Python data objects to the code for further inspection and processing.

For example this simple Python program below can be used to determine whether a post has public visibility or not:

#!/usr/bin/python

import json
import sys

def is_public(acl):
  """Return True, if access control object contains the PUBLIC pseudo-circle."""
  if ('visibleToStandardAcl' in acl
      and 'circles' in acl['visibleToStandardAcl']):
    for circle in acl['visibleToStandardAcl']['circles']:
      if circle['type'] == 'CIRCLE_TYPE_PUBLIC':
        return True
  return False

# filter out only the posts which have public visibility.
for filename in sys.argv[1:]:
  post = json.load(open(filename))
  if is_public(post['postAcl']):
    print (filename)


Running this as ./public_posts.py ~/Download/Takeout/Google+\ Stream/Posts/*.json would return the list of filenames which contain the publicly visible posts only. By successfully parsing all the .json files in the (i.e. without trowing any errors), we can also convince ourselves that the archive contains data in syntactically valid JSON format.

Sunday, October 21, 2018

Google+ Migration - Part I: Takeout



For the last 7 years, I have been using Google+ as my primary social sharing site - with
automated link-sharing to Twitter. With Google+ going away, I am looking to migrate my public postings to a new site, where they can be presented in a similar way. As the target for the migration, I have chosen a local community-operated pod of the diaspora* network.

Migrating social media-data is particularly challenging. They are by definition an amalgamation of data from different sources: links, re-sharing, likes comments etc. - all potentially created by different users of the original social sharing platform. Also contrary to other data-sets (e.g. contact-lists, calendars or spreadsheets), there are no established, standardized data formats for exchanging social networking site activity in a platform independent way.

Without being an expert in copyright and data protection law, I am taking a very conservative approach to ownership and consent. Users of the original Google+ site were explicitly ok with the following cross-user interactions from the perspective of my post-stream:

  • re-sharing post of other users (while respecting the original scope)
  • other users liking ("plussing") my posts
  • other users commenting on my posts

Since none of these other users have ever granted my an explicit permission to replicate this content in a new form on another platform, I will only replicate my original content without any interactions, but in addition to public posts also include posts to communities, which I consider public. The purpose of this tutorial is to present some tools and methods that could be used to process and select data in a different way to implement a different policy.

For the technicalities of migration, I am making the following assumptions assumptions:

  • as input, only rely on data that is contained in the takeout-archive. This way the migration could be repeated after the original Google+ site is now longer accessible.
  • use the Python programming language for parsing, processing and re-formatting of the data.
  • use a bot with a Python API-library for diaspora* to repost (very slowly!) to the new target system.
While Python is highly portable, any examples and instructions in this tutorial will assume a unix-like operating system an be tested in particular on a current Debian GNU/Linux based system.

Ordering Takeout

For over 10 years, a Google team calling itself the Data Liberation Front, has been working on a promise that users should be able to efficiently extract any of the data they create online with Google services and take them elsewhere. The resulting service is takeout.google.com.

In order to get an archive suitable for processing, we need to request a takeout archive of the Google+ stream data in JSON format. Here are some basic instructions on how to request a new takeout archive.

For the purpose of this migration, we only need to select "Google+ Stream" in data selection. However, we need to open the extension panel and select JSON format instead of the default HTML. While the HTML export only contains the information necessary to display each post, the JSON export contains additional meta-data like access-rights in an easily machine readable format.

Given the high load on the service right now, archive creation for large streams can take a while or be incomplete. We should expect this process to become more reliable again in the next few weeks.


The next step will be to understand the structure of the data in the takeout archive.