Sunday, November 11, 2018

Google+ Migration - Part III: Content Transformation

<- Part II: Understanding the takeout archive 

After we have had a look at the structure of the takeout archive, we can build some scripts to translate the content of the JSON post description into a format that is suitable for import into the target system, which in our case is Diaspora*.

The following script is a proof of concept conversion of a single post file from the takeout archive to text string that is suitable for upload to a Diaspora* server using the diaspy API.

Images are more challenging and will be handled separately in a later episode. There is also no verification on whether the original post had public visibility and should be re-posted publicly.

The main focus is on the parse_post and format_post methods. The purpose of the parse_post method is to extract the desired information from the JSON representation of a post, while the format_post method uses this data to format the input text needed to create a more or less equivalent post.

While the the post content text in the google+ takeout archive is formatted in pseudo-HTML, Diaspora* post are formatted in Markdown. In order to convert the HTML input to Markdown output, we can use the html2text Python library.

Given the difference in formatting and conventions, there is really no right or wrong way to reformat each post, but a matter of choice.

The choices made here are:

  • If the original post contained text, the text is included at the top of the post with minimal formatting and any URL links stripped out. Google+ posts may include +<username> reference which may look odd. Hashtags should be automatically re-hashtagified on the new system, as long as it uses the hashtag convention.
  • The post includes a series of static hashtags which identify it as a archived, re-posted from G+. Additional hashtags can be generated during the parsing process, e.g. to identify re-shares of photos
  • The original post date and optional community or collection names are included with each post, as we intend to make it obvious that this is a re-posted archive and not a transparent migration.
  • Link attachments are added at the end and should be rendered as a proper link attachment with preview snipped and image if supported - presumably by using something like the OpenGraph markup annotations of the linked page.
  • Deliberately not include any data which results from post activity by other users including likes or re-shares. The only exception is that if a re-shared post includes and external link, this link is included in the post with a "hat tip" to the original poster using their G+ display name at the time of export.

The functionality to post to Diaspora* is included at this time merely as a demonstration that this can indeed work and is not intended to be used without additional operational safeguards.

#!/usr/bin/env python

import datetime
import json
import sys

import dateutil.parser
import diaspy
import html2text

SERVER = '<your diaspora server URL>'
USERNAME = '<your diaspora username>'
PASSWORD = '<not really a good idea...>'

TOOL_NAME = 'G+ repost'
HASHTAGS = ['repost', 'gplusarchive', 'googleplus', 'gplusrefugees', 'plexodus']


def post_to_diaspora(content, filenames=[]):
  c = diaspy.connection.Connection(pod = SERVER,
                                   username = USERNAME,
                                   password = PASSWORD)
  c.login()
  stream = diaspy.streams.Stream(c)
  stream.post(content, provider_display_name = TOOL_NAME)


def format_post(content, link, hashtags, post_date, post_context):
    output = []

    if content:
        converter = html2text.HTML2Text()
        converter.ignore_links = True
        converter.body_width = 0
        output.append(converter.handle(content))
    
    if hashtags:
        output.append(' '.join(('#' + tag for tag in hashtags)))
        output.append('')

    if post_date:
        output.append('Originally posted on Google+ on %s%s' 
                      % (post_date.strftime('%a %b %d, %Y'),
                         '  ( ' + post_context + ')' if post_context else ''))
        output.append('')

    if link:
        output.append(link)

    return '\n'.join(output)


def parse_post(post_json):
    post_date = dateutil.parser.parse(post_json['creationTime'])
    content = post_json['content'] if 'content' in post_json else ''
    link = post_json['link']['url'] if 'link' in post_json else ''

    hashtags = HASHTAGS

    # TODO: Dealing with images later...
    if 'album' in post_json or 'media' in post_json:
        hashtags = hashtags + ['photo', 'photography']

    # If a shared post contains a link, extract that link
    # and give credit to original poster.
    if 'resharedPost' in post_json and 'link' in post_json['resharedPost']:
        link = post_json['resharedPost']['link']['url']
        content = content + ' - H/t to ' + post_json['resharedPost']['author']['displayName']
        hashtags.append('reshared')

    acl = post_json['postAcl']
    post_context = ''
    if 'communityAcl' in acl:
        post_context = acl['communityAcl']['community']['displayName']

    return format_post(content, link, hashtags, post_date, post_context)


# ----------------------
filename = sys.argv[1]
post_json = json.load(open(filename))
print(parse_post(post_json))

if len(sys.argv) > 2 and sys.argv[2] == 'repost':
    print ('posting to %s as %s' % (SERVER, USERNAME))
    post_to_diaspora(parse_post(post_json))