Sunday, December 9, 2018

Google+ Migration - Part VII: Conversion & Staging

<- Part VI: Location, Location, Location 

We are now ready to put all the pieces together for exporting to Diaspora*, the new target platform.

If we had some sort of "Minitrue" permissions to rewrite history on the target system, the imported posts could appear to always have been there since their original G+ posting date.

However since we have only have regular user permissions, the only choice is to post them as new posts at some future point in time. The most straightforward way to upload the archive would be to re-post in chronological order as quickly as possible without causing overload.

If the new account is not only used for archive purposes, we may want to maximize the relevance of the archive posts in the new stream. In this case, a better way would be to post each archive post on the anniversary of its original post-date, creating some sort of "this day in history" series. This would require that the upload activity needs to be scheduled over at least a year, causing some operational challenges.

In order to minimize the risk of things going wrong with with generating the new posts during this drawn out,  hopefully unattended and automated posting process, we are trying to do as much of the conversion in a single batch and stage the converted output to be uploaded/posted to the destination system at some planned future time. This would also allow for easier inspection of the generated output or to adapt the process for a different destination system, e.g. a blog.

The following python script read a list of post filenames from the takeout archive, extracts relevant information from the JSON object in each file and generates the new post content in Markdown format. Besides being the input format for Diaspora*, Markdown is widely used and can also easily be converted into other formats, including HTML. The list of posts we want to export can be generated using the post_filter.py script from part IV in this series. We also have downloaded the images references in any of these posts using the image_cache.py script from part V and stored them in a location like /tmp/images.

Most of my posts are either photo or link sharing, with just a line or two of commentary. More towards a twitter use-case than he long-form posts that G+ would support equally well. The script contains several assumptions that are optimized for this use-case. For example HTML links are stripped from the text content, assuming that each post only has one prominent link that is being shared. Many of my photo sharing posts contain location information, which is extracted here into additional hashtags as well as an additional location link on OpenStreetMap.

Hashtags are a more central concept on Diaspora* than they were on G+. Other than some static pre-defined hashtags to identify the posts as an automated repost from G+, there are additional hashtags that are added based on the type of post - e.g. photo sharing, stripped down re-sharing of another post, sharing of a YouTube video or high level geo location info.

Before running the conversion & staging script, we need to decide which day in the future we want to start posting the archive. Given an staging directory, e.g. /tmp/staging_for_diaspora, the script will create a sub-directory for each day that contains scheduled post activity. In each daily schedule directory, the script creates a unique sub-directory containing a content.md file with the new post text in Markdown as well as any images to be attached. The unique name for each post consists of the date of the original post data plus what seems to be a unique ID in the post URL, in absence of a real unique post ID in the JSON file. For example a post originally posted on Jul 14 2018, would be stored in /tmp/stage_for_diaspora/20190714/20180714_C3RUWSDE7X7/content.md formatted as:

Port Authority Inland Terminal - from freight hub to Internet switching center.

#repost #bot #gplusarchive #googleplus #throwback #photo #photography #US #UnitedStates #NYC

[111 8th Ave](https://www.openstreetmap.org/?lat=40.7414688&lon=-74.0033873&zoom=17)
Originally posted Sat Jul 14, 2018 on Google+ (Alte Städte / Old Towns)

Or the post which shared the link to the first part of this series would be re-formatted as:

Starting to document the process of migrating my public post stream to diaspora*.  
  
The plan is to process the takeout archive in Python and generate (somewhat) equivalent diaspora* posts using diaspy.  

#repost #bot #gplusarchive #googleplus #throwback

Originally posted Sun Oct 21, 2018 on Google+  (Google+ Mass Migration)

https://blog.kugelfish.com/2018/10/google-migration-part-i-takeout.html

The script also checks the current status of link URLs to avoid sharing a broken link. While we tell our children to be careful since "the Internet never forgets", in reality many links are gone after just a few years - the whole G+ site soon being an example of that.

Since Disapora* is not particularly well optimized for photo-processing and to help save storage cost on the pod server, the script can also downscale images to a fixed maximum size that is suitable for on-screen display.

For example by running the script as
./post_transformer.py --image-dir=/tmp/images --staging-dir=/tmp/stage_for_diaspora --start-date=20191001 --image-size=1024 < /tmp/public_posts.txt
we are assuming that we want to start publishing on Oct 1 2019 that images are located in /tmp/images and should be limited to a maximum size of 1024 pixels for publishing and the whole output will be staged in /tmp/stage_for_diaspora.

Since this script does not do any posting itself, we can run it as many times as we need to, inspect the output and make some adjustments as necessary. Link URL checking and geo-coding (see part VI) require network access from the machine where the script is being executed. In principle, we could manually post the generated output to some target system, but in a future episode, we will demonstrated a fully automated way of posting to diaspora, assuming that

In addition to what is already included in the Python standard library (2.7) we need the following additional packages:
Which can be installed for example using PIP: pip install python-dateutil geopy html2text Pillow pycountry requests


#!/usr/bin/env python
# -*- coding: utf-8 -*-

import argparse
import codecs
import datetime
import io
import json
import os
import sys

import dateutil.parser
import geopy.geocoders
import html2text
import PIL.Image 
import pycountry
import requests

ISO_DATE = '%Y%m%d'

HASHTAGS = ['repost', 'bot', 'gplusarchive', 'googleplus', 'throwback']

geocoder = geopy.geocoders.Nominatim(user_agent='gplus_migration', timeout=None)

def get_location_hashtags(loc):
  """Return hashtags related to the location of the post: ISO country code, country name, city/town."""
  hashtags = []
  if 'latitude' in loc and 'longitude' in loc:
    addr = geocoder.reverse((loc['latitude'], loc['longitude'])).raw
    if 'address' in addr:
      addr = addr['address']
      cc = addr['country_code'].upper()
      hashtags.append(cc)
      hashtags.append(pycountry.countries.get(alpha_2=cc).name.replace(' ',''))
      for location in ['city', 'town', 'village']:
        if location in addr:
          hashtags.append(addr[location].replace(' ', ''))
  return hashtags


def get_location_link(loc):
  """Return a link to OpenStreetMap for the post location."""
  if 'latitude' in loc and 'longitude' in loc and 'displayName' in loc:
    map_url = ('https://www.openstreetmap.org/?lat=%s&lon=%s&zoom=17' % (loc['latitude'], loc['longitude']))
    return '[%s](%s)' % (loc['displayName'], map_url)
  else:
    return None


def validate_url(url):
  """Veify whether a URL still exists, including a potential redirect."""
  user_agent = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5)'
                 + ' AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
  try:
    r = requests.get(url, headers=user_agent)
    if r.status_code != 200:
      return None
    return r.url
  except requests.ConnectionError:
    return None
  

def get_image_name(resource_name):
  """Generate image cache name for media resource-name."""
  return resource_name.replace('media/', '', 1) + '.jpg'


def copy_downscale_image(source, destination, max_size):
  """Copy a downscaled version of the image to the staging location."""
  img = PIL.Image.open(source)
  source_size = max(img.size[0], img.size[1])
  if not max_size or source_size <= max_size:
    img.save(destination, 'JPEG2000') 
  else:
    scale = float(max_size) / float(source_size)
    img = img.resize((int(img.size[0] * scale), int(img.size[1] * scale)), PIL.Image.LANCZOS)
    img.save(destination, 'JPEG2000')


def parse_post(post_json):
  """Extract relevant information from a JSON formatted post."""
  post_date = dateutil.parser.parse(post_json['creationTime'])
  content = post_json['content'] if 'content' in post_json else ''
  link = post_json['link']['url'] if 'link' in post_json else ''

  hashtags = HASHTAGS[:] # make a copy
  images = []

  if 'media' in post_json:
    media = post_json['media']
    if media['contentType'] == 'video/*' and 'youtube' in media['url']:
    # if the media is a youtube URL, convert into a link-sharing post
      link = media['url']
      hashtags = hashtags + ['video', 'YouTube']
    elif media['contentType'] == 'image/*':
      hashtags.extend(['photo', 'photography'])
      images.append(get_image_name(media['resourceName']))
    else:
      return None # unsupported media format

  if 'album' in post_json:
    hashtags = hashtags + ['photo', 'photography']
    for image in post['album']['media']:
      if image['contentType'] == 'image/*':
        images.append(get_image_name(image['resourceName']))
    if len(images) == 0:
      return None # no supported image attachment in album

  # If a shared post contains a link, extract that link
  # and give credit to original poster.
  if 'resharedPost' in post_json:
    if 'link' in post_json['resharedPost']:
      link = post_json['resharedPost']['link']['url']
      content = content + ' - H/t to ' + post_json['resharedPost']['author']['displayName']
      hashtags.append('reshared')
    else:
      return None # reshare without a link attachment

  acl = post_json['postAcl']
  post_context = {}
  if 'communityAcl' in acl:
    post_context['community'] = acl['communityAcl']['community']['displayName']

  if 'location' in post_json:
    hashtags.extend(get_location_hashtags(post_json['location']))
    location_link = get_location_link(post_json['location'])
    if location_link:
      post_context['location'] = location_link

  return (content, link, hashtags, post_date, post_context, images)


def format_content(content, link, hashtags, post_date, post_context):
  """Generated a Markdown formatted string from the pieces of a post."""
  output = []
  if content:
    converter = html2text.HTML2Text()
    converter.ignore_links = True
    converter.body_width = 0
    output.append(converter.handle(content))
  if hashtags:
    output.append(' '.join(('#' + tag for tag in hashtags)))
    output.append('')
  if 'location' in post_context:
    output.append(post_context['location'])
  if post_date:
    output.append('Originally posted %s on Google+ %s' 
                    % (post_date.strftime('%a %b %d, %Y'),
                       '  (' + post_context['community'] + ')' if 'community' in post_context else ''))
    output.append('')
  if link:
    output.append(link)
    output.append('')
  return u'\n'.join(output)


def get_post_directory(outdir, post_date, start_date, url):
  """Generate staging output directory based on schedule date & post unique ID."""
  post_id = post_date.strftime(ISO_DATE) + '_' + url.split('/')[-1]
  schedule_date = post_date.replace(year=start_date.year, tzinfo=None)
  if schedule_date < start_date:
    schedule_date = schedule_date.replace(year=schedule_date.year + 1)
  return os.path.join(outdir, schedule_date.strftime(ISO_DATE), post_id)
  

# --------------------
parser = argparse.ArgumentParser(description='Coolect post images referenced from a set of posts')
parser.add_argument('--image-dir', dest='image_dir', action='store', required=True)
parser.add_argument('--staging-dir', dest='staging_dir', action='store', required=True)
parser.add_argument('--image-size', dest='image_size', action='store', type=int)
parser.add_argument('--start-date', dest='start_date', action='store', type=dateutil.parser.parse, required=True)
parser.add_argument('--refresh', dest='refresh', action='store_true')
args = parser.parse_args()

if not os.path.exists(args.image_dir):
  sys.stderr.write('image-dir not found: ' + args.image_dir + '\n')
  sys.exit(-1)

sys.stdout = codecs.getwriter('utf8')(sys.stdout)

print ('staging directory: %s' % args.staging_dir)
print ('publish start date: %s' % args.start_date.strftime(ISO_DATE))

count = 0
for filename in sys.stdin.readlines():
  filename = filename.strip()
  post = json.load(open(filename))
  post_data = parse_post(post)

  if post_data:
    content, link, hashtags, post_date, post_context, images = post_data
    post_dir = get_post_directory(args.staging_dir, post_date, args.start_date, post['url'])

    if not args.refresh and os.path.exists(post_dir):
      continue

    # Avoid exporting posts with stale links.
    if link:
      link = validate_url(link)
      if not link:
        print ('\nURL %s not found, skipping export for %s' % (post_data[1], post_dir))
        continue

    # Output content in Markdown format to staging location.
    if not os.path.exists(post_dir):
      os.makedirs(post_dir)
     
    content_file = io.open(os.path.join(post_dir, 'content.md'), 'w', encoding='utf-8')
    content_file.write(format_content(content, link, hashtags, post_date, post_context))
    content_file.close()

    for i, image in enumerate(images):
      source = os.path.join(args.image_dir, image)
      destination = os.path.join(post_dir, 'img_%d.jpg' % i)
      copy_downscale_image(source, destination, args.image_size)
      
    count += 1
    sys.stdout.write('.')
    sys.stdout.flush()
    
print ('%d posts exported to %s' % (count, args.staging_dir))    


No comments:

Post a Comment