Friday, December 28, 2018

The Fallacy of distributed = good

I have recently been looking for an alternative social media platform and started using Diaspora* via the pod. Not unlike the cryptocurrency community, the proponents of the various platforms in the Fediverse seem to rather uncritically advocate the distributed nature of these platforms as an inherently positive property in particular when it comes to privacy and data protection.

I tend to agree with Yuval Harari who argues in "Sapiens"  that empires or scaled, centralized forms of organization are one of Homo Sapiens' significant cultural accomplishments. A majority of humans through history have lived as part of some sort of empire. Empires can provide prosperity and ensure lasting peace and stability - like the Pax Romana or in my generation, the Pax Americana. We often have a love/hate relationship with empires - even many protesters who are busy burning American flags during the day, secretly hope that their children some day will get into Harvard and have a better life. Libertarians seem to think of a land without much central governance as a place where strong individuals can realize their dreams of freedom and prosperity - like the romanticized frontier world of daytime TV westerns. My cynical self imagines rather a kind of post-apocalyptic Mad Max world, where our school bullies become sadistic local warlords. In European history, the thousand years after the fall of the Roman Empire, which featured highly distributed power structures are called the dark middle ages for good reasons.

In the online world, many of us love to hate the big social media platforms and yet billions of us return there every month. Maybe because this is where our friends, family and everybody else are as well or because they generally offer a smooth and polished service or are good a giving us what we want? When it comes to security, the largest platforms can afford to invest more in security and have impressively competent security and operations teams to protect our data from being compromised. As the Roman legions, they do not always succeed and their failures are highly publicized. But more often than not, they succeed and looking at it dispassionately, our data is probably nowhere as save as with one of the large providers of online or cloud services. Yes, the large platforms are driven by commercial interests, which also makes them  predictable as they have a lot to loose and tend to follow laws with pedantic sophistication.

I fail to see how a distributed architecture alone should inherently improve privacy or data protection. For most of us in the Fediverse, our pod-admins rather than us are de facto in possession and control of our data. De jure, they don't have to worry much about data protection and privacy laws because they are too small to be on the radar of any regulatory agency. Pods can disappear from the network at any time without warning and account migration between pods is generally not trivial, if possible at all (For example, Diaspora* currently allows profile export, but not yet import into another pod).

On the bright side, the Fediverse would allows any of us who are tech-savvy and dedicated enough to run our own pod and become the admins of our own data and lords of our own domain. But in reality, how many of us are really doing this?

For the rest of us, what is left to do is to choose carefully which pod to join. Maybe one that is run by more than one person, a cooperative, club or association? Maybe see whether we could contribute to its operation either financially or through volunteering. And always be nice to our pod-admins, not just because they essentially own our social-media persona, but because they generally do a tedious and thankless labor of love and on top of that most likely also bear the brunt of the financial burden.

While architecturally, operationally and/or organizationally distributed systems maybe interesting and may have some advantages as well as disadvantages, we should not automatically assume that they are better just because they are distributed.

Sunday, December 16, 2018

Google+ Migration - Part VIII: Export to Diaspora*

<- Part VII: Conversion & Staging

The last stage of the process is to finally export the converted posts to Diaspora* the chosen target system. As we want these post to appear slowly and close to their original post date anniversary, this process is going to be drawn out over at least one year.

While we could do this by hand, it should ideally be done by some automated process. For this to work, we need some kind of server-type machine that is up and running and connected to the Internet frequently enough during a whole year.

The resource requirements are quite small, except for storing the staged data which for some users could easily be in multiple gigabytes, mostly depending on the number posts with images.

Today it is quite easy to get small & cheap virtual server instances from any cloud provider, for example the micro sized compute engine instances on Google Cloud should be part of the free tier even.

I also still have a few of the small, low power Rasbperry Pi boards lying around, one of which has been mirroring my public G+ posts to Twitter since 2012 and is still active today.

An additional challenge is that Diaspora* does at this point not offer an official and supported API. The diaspy Python API package is essentially "screen-scraping" the callback handler URLs of the corresponding diaspora server and might break easily when the server is being upgraded to a new version, which is happening several times per year on a well maintained pod. For that reason, we are also adding additional support to send error logs including exception stack traces to an external email system so that we can hopefully notice quickly if/when something is going wrong.

I am planning to run the following script about every 3 hours on my network connected Raspberry Pi server using cron with the following crontab entry (see instructions for setting up a crontab entry):

42 */3 * * * cd /home/pi/post_bot && ./ --staging-dir=staging --login-info=logins.json --mail

This should run every 42nd minute of every hour divisible by 3 on every day, assuming there is a directory at /home/pi/post_bot containing the following script as, a sub-directory staging/ with the data generated using the process described in the previous episode and a file logins.json containing the login credentials for the diaspora pod and optionally an email service to be used for error notifications.

While storing passwords in clear-text on a server is a certifiably bad idea, we are at least avoiding to hard-code them in the script and storing them in a separate file instead, using JSON format, since we are already heavily using JSON for this project. The login credentials file has the following format, with the "mail" section being optional:

  "diaspora": {
    "pod-url": "<URL for diaspora pod, e.g.>",
    "username": "<username valid on this pod>",
    "password": "<clear text password for diaspora pod account>"
  "mail": {
    "smtp-server": "<SMTP mail server address, e.g.>",
    "username": "<username, typically email-address>",
    "password": "<clear text password for email account>",
    "recipient": "<recipient email address for error messages>"

There are two ways to run this script: in a manual testing mode to upload a particular post, e.g. with ./ --staging-dir=testing --login-info=logins.json --test=staging/20181021/20181021_4XBeoKCnV1N/ and the regular production mode to be called periodically, e.g. from cron e.g. as ./ --staging-dir=staging --login-info=logins.json --mail. which auto-selects the next eligible post to be sent, if any.

For compatibility with the most recent version of diaspy, we are using python3 (e.g. install additionally with sudo apt-get install python3 python3-pip) and the install the additional packages with pip3 install python-dateutil requests diaspy-api bs4.

However, the latest package version of diaspy is already not working properly for image upload so it may be necessary to download the latest version directly from github and copy the contents of the "diaspy" subdirectory into /home/pi/post_bot as a local copy of the module.

As with any of the code snippets in this project, this is merely meant as an inspiration for your own implementations and not as a usable/finished product in any sense.

When posting to an active social media platform, we should also be very considerate of not overwhelming the stream with archive content and be ready to engage with readers also on automatically posted content, as the goal should be to create new connections and conversations.

#!/usr/bin/env python3

import argparse
import datetime
from email.mime.text import MIMEText
import glob
from io import StringIO
import json
import logging
import logging.handlers
import os
import smtplib
import shutil
import sys

import dateutil.parser
import diaspy

ISO_DATE = '%Y%m%d'
TOOL_NAME = 'G+ post-bot'

def send_error_message(txt, email_info):
  """Send a crash/error message to a configured email address."""
  server = smtplib.SMTP(email_info['smtp-server'])
  server.login(email_info['username'], email_info['password'])
  msg = MIMEText(txt)
  msg['From'] = email_info['username'] 
  msg['To'] =  email_info['recipient']
  msg['Subject'] = 'error message from %s on %s' % (TOOL_NAME, os.uname()[1])
  server.sendmail(email_info['username'], email_info['recipient'], msg.as_string())

def post_to_diaspora(post_dir, login_info):
  """Load a post from staging directory and send to diaspora server."""
  cwd = os.getcwd()
  content = open('').read()
  images = sorted(glob.glob('img_*.jpg'))

  c = diaspy.connection.Connection(pod=login_info['pod-url'],
  stream = diaspy.streams.Stream(c)
  if not images:, provider_display_name = TOOL_NAME)
    ids = [stream._photoupload(name) for name in images], photos=ids, provider_display_name=TOOL_NAME)

# --------------------
parser = argparse.ArgumentParser(description='Coolect post images referenced from a set of posts')
parser.add_argument('--staging-dir', dest='staging_dir', action='store', required=True)
parser.add_argument('--login-info', dest='login_info', action='store', required=True)
parser.add_argument('--test', dest='test_data', action='store')
parser.add_argument('--mail-errors', dest='mail', action='store_true')

args = parser.parse_args()

# Set up logging to both syslog and a memory buffer.
log_buffer = StringIO()    
logging.basicConfig(stream=log_buffer, level=logging.INFO)
syslog = logging.handlers.SysLogHandler(address='/dev/log')
syslog.setFormatter(logging.Formatter('diaspora-post-bot: %(levelname)s %(message)s'))

# Load login/authentication data from a separate file.
login_info = json.load(open(args.login_info))

if not 'diaspora' in login_info:
  print('%s does not contain diaspora login section' % args.login_info)

if args.test_data:
  # Directly load a post staging directory to diaspora.
  post_to_diaspora(args.test_data, login_info['diaspora'])
  # Find next post directory and load to diaspora.
  # Intended to run un-attended from cron-job or similar at periodic intervals (e.g. every 2h)
  try:'starting export from %s' % args.staging_dir)
    dirs = sorted(glob.glob(os.path.join(args.staging_dir, '[0-9]*')))
    if not dirs:'no more data to export')
    next_dir = dirs[0]

    # Check if post date for next staging directory has been reached.
    if dateutil.parser.parse(os.path.basename(next_dir)) >'next dir not yet ready for export: %s' % os.path.basename(dirs[0]))
      sys.exit(0)'found next active staging directory %s' % next_dir)

    # Find next post in staging directory or delete staging directory when empty.
    posts = sorted(os.listdir(next_dir))
    if not posts:'deleting empty staging directory: %s' % next_dir)

    # Move exported posts to a backup directory.
    completion_dir = os.path.join(args.staging_dir, 'completed')
    if not os.path.exists(completion_dir):

    # Send next post to diaspora server.
    post_dir = os.path.join(next_dir, posts[0])'posting %s...' % post_dir)
    post_to_diaspora(post_dir, login_info['diaspora'])
    shutil.move(post_dir, completion_dir)'post completed')
  except (KeyboardInterrupt, SystemExit):
  except Exception as e:
    logging.exception('error in main loop')
    if args.mail and 'mail' in login_info:
      send_error_message(log_buffer.getvalue(), login_info['mail'])

Sunday, December 9, 2018

Google+ Migration - Part VII: Conversion & Staging

<- Part VI: Location, Location, Location 

We are now ready to put all the pieces together for exporting to Diaspora*, the new target platform.

If we had some sort of "Minitrue" permissions to rewrite history on the target system, the imported posts could appear to always have been there since their original G+ posting date.

However since we have only have regular user permissions, the only choice is to post them as new posts at some future point in time. The most straightforward way to upload the archive would be to re-post in chronological order as quickly as possible without causing overload.

If the new account is not only used for archive purposes, we may want to maximize the relevance of the archive posts in the new stream. In this case, a better way would be to post each archive post on the anniversary of its original post-date, creating some sort of "this day in history" series. This would require that the upload activity needs to be scheduled over at least a year, causing some operational challenges.

In order to minimize the risk of things going wrong with with generating the new posts during this drawn out,  hopefully unattended and automated posting process, we are trying to do as much of the conversion in a single batch and stage the converted output to be uploaded/posted to the destination system at some planned future time. This would also allow for easier inspection of the generated output or to adapt the process for a different destination system, e.g. a blog.

The following python script read a list of post filenames from the takeout archive, extracts relevant information from the JSON object in each file and generates the new post content in Markdown format. Besides being the input format for Diaspora*, Markdown is widely used and can also easily be converted into other formats, including HTML. The list of posts we want to export can be generated using the script from part IV in this series. We also have downloaded the images references in any of these posts using the script from part V and stored them in a location like /tmp/images.

Most of my posts are either photo or link sharing, with just a line or two of commentary. More towards a twitter use-case than he long-form posts that G+ would support equally well. The script contains several assumptions that are optimized for this use-case. For example HTML links are stripped from the text content, assuming that each post only has one prominent link that is being shared. Many of my photo sharing posts contain location information, which is extracted here into additional hashtags as well as an additional location link on OpenStreetMap.

Hashtags are a more central concept on Diaspora* than they were on G+. Other than some static pre-defined hashtags to identify the posts as an automated repost from G+, there are additional hashtags that are added based on the type of post - e.g. photo sharing, stripped down re-sharing of another post, sharing of a YouTube video or high level geo location info.

Before running the conversion & staging script, we need to decide which day in the future we want to start posting the archive. Given an staging directory, e.g. /tmp/staging_for_diaspora, the script will create a sub-directory for each day that contains scheduled post activity. In each daily schedule directory, the script creates a unique sub-directory containing a file with the new post text in Markdown as well as any images to be attached. The unique name for each post consists of the date of the original post data plus what seems to be a unique ID in the post URL, in absence of a real unique post ID in the JSON file. For example a post originally posted on Jul 14 2018, would be stored in /tmp/stage_for_diaspora/20190714/20180714_C3RUWSDE7X7/ formatted as:

Port Authority Inland Terminal - from freight hub to Internet switching center.

#repost #bot #gplusarchive #googleplus #throwback #photo #photography #US #UnitedStates #NYC

[111 8th Ave](
Originally posted Sat Jul 14, 2018 on Google+ (Alte St├Ądte / Old Towns)

Or the post which shared the link to the first part of this series would be re-formatted as:

Starting to document the process of migrating my public post stream to diaspora*.  
The plan is to process the takeout archive in Python and generate (somewhat) equivalent diaspora* posts using diaspy.  

#repost #bot #gplusarchive #googleplus #throwback

Originally posted Sun Oct 21, 2018 on Google+  (Google+ Mass Migration)

The script also checks the current status of link URLs to avoid sharing a broken link. While we tell our children to be careful since "the Internet never forgets", in reality many links are gone after just a few years - the whole G+ site soon being an example of that.

Since Disapora* is not particularly well optimized for photo-processing and to help save storage cost on the pod server, the script can also downscale images to a fixed maximum size that is suitable for on-screen display.

For example by running the script as
./ --image-dir=/tmp/images --staging-dir=/tmp/stage_for_diaspora --start-date=20191001 --image-size=1024 < /tmp/public_posts.txt
we are assuming that we want to start publishing on Oct 1 2019 that images are located in /tmp/images and should be limited to a maximum size of 1024 pixels for publishing and the whole output will be staged in /tmp/stage_for_diaspora.

Since this script does not do any posting itself, we can run it as many times as we need to, inspect the output and make some adjustments as necessary. Link URL checking and geo-coding (see part VI) require network access from the machine where the script is being executed. In principle, we could manually post the generated output to some target system, but in a future episode, we will demonstrated a fully automated way of posting to diaspora, assuming that

In addition to what is already included in the Python standard library (2.7) we need the following additional packages:
Which can be installed for example using PIP: pip install python-dateutil geopy html2text Pillow pycountry requests

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import argparse
import codecs
import datetime
import io
import json
import os
import sys

import dateutil.parser
import geopy.geocoders
import html2text
import PIL.Image 
import pycountry
import requests

ISO_DATE = '%Y%m%d'

HASHTAGS = ['repost', 'bot', 'gplusarchive', 'googleplus', 'throwback']

geocoder = geopy.geocoders.Nominatim(user_agent='gplus_migration', timeout=None)

def get_location_hashtags(loc):
  """Return hashtags related to the location of the post: ISO country code, country name, city/town."""
  hashtags = []
  if 'latitude' in loc and 'longitude' in loc:
    addr = geocoder.reverse((loc['latitude'], loc['longitude'])).raw
    if 'address' in addr:
      addr = addr['address']
      cc = addr['country_code'].upper()
      hashtags.append(pycountry.countries.get(alpha_2=cc).name.replace(' ',''))
      for location in ['city', 'town', 'village']:
        if location in addr:
          hashtags.append(addr[location].replace(' ', ''))
  return hashtags

def get_location_link(loc):
  """Return a link to OpenStreetMap for the post location."""
  if 'latitude' in loc and 'longitude' in loc and 'displayName' in loc:
    map_url = ('' % (loc['latitude'], loc['longitude']))
    return '[%s](%s)' % (loc['displayName'], map_url)
    return None

def validate_url(url):
  """Veify whether a URL still exists, including a potential redirect."""
  user_agent = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5)'
                 + ' AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
    r = requests.get(url, headers=user_agent)
    if r.status_code != 200:
      return None
    return r.url
  except requests.ConnectionError:
    return None

def get_image_name(resource_name):
  """Generate image cache name for media resource-name."""
  return resource_name.replace('media/', '', 1) + '.jpg'

def copy_downscale_image(source, destination, max_size):
  """Copy a downscaled version of the image to the staging location."""
  img =
  source_size = max(img.size[0], img.size[1])
  if not max_size or source_size <= max_size:, 'JPEG2000') 
    scale = float(max_size) / float(source_size)
    img = img.resize((int(img.size[0] * scale), int(img.size[1] * scale)), PIL.Image.LANCZOS), 'JPEG2000')

def parse_post(post_json):
  """Extract relevant information from a JSON formatted post."""
  post_date = dateutil.parser.parse(post_json['creationTime'])
  content = post_json['content'] if 'content' in post_json else ''
  link = post_json['link']['url'] if 'link' in post_json else ''

  hashtags = HASHTAGS[:] # make a copy
  images = []

  if 'media' in post_json:
    media = post_json['media']
    if media['contentType'] == 'video/*' and 'youtube' in media['url']:
    # if the media is a youtube URL, convert into a link-sharing post
      link = media['url']
      hashtags = hashtags + ['video', 'YouTube']
    elif media['contentType'] == 'image/*':
      hashtags.extend(['photo', 'photography'])
      return None # unsupported media format

  if 'album' in post_json:
    hashtags = hashtags + ['photo', 'photography']
    for image in post['album']['media']:
      if image['contentType'] == 'image/*':
    if len(images) == 0:
      return None # no supported image attachment in album

  # If a shared post contains a link, extract that link
  # and give credit to original poster.
  if 'resharedPost' in post_json:
    if 'link' in post_json['resharedPost']:
      link = post_json['resharedPost']['link']['url']
      content = content + ' - H/t to ' + post_json['resharedPost']['author']['displayName']
      return None # reshare without a link attachment

  acl = post_json['postAcl']
  post_context = {}
  if 'communityAcl' in acl:
    post_context['community'] = acl['communityAcl']['community']['displayName']

  if 'location' in post_json:
    location_link = get_location_link(post_json['location'])
    if location_link:
      post_context['location'] = location_link

  return (content, link, hashtags, post_date, post_context, images)

def format_content(content, link, hashtags, post_date, post_context):
  """Generated a Markdown formatted string from the pieces of a post."""
  output = []
  if content:
    converter = html2text.HTML2Text()
    converter.ignore_links = True
    converter.body_width = 0
  if hashtags:
    output.append(' '.join(('#' + tag for tag in hashtags)))
  if 'location' in post_context:
  if post_date:
    output.append('Originally posted %s on Google+ %s' 
                    % (post_date.strftime('%a %b %d, %Y'),
                       '  (' + post_context['community'] + ')' if 'community' in post_context else ''))
  if link:
  return u'\n'.join(output)

def get_post_directory(outdir, post_date, start_date, url):
  """Generate staging output directory based on schedule date & post unique ID."""
  post_id = post_date.strftime(ISO_DATE) + '_' + url.split('/')[-1]
  schedule_date = post_date.replace(year=start_date.year, tzinfo=None)
  if schedule_date < start_date:
    schedule_date = schedule_date.replace(year=schedule_date.year + 1)
  return os.path.join(outdir, schedule_date.strftime(ISO_DATE), post_id)

# --------------------
parser = argparse.ArgumentParser(description='Coolect post images referenced from a set of posts')
parser.add_argument('--image-dir', dest='image_dir', action='store', required=True)
parser.add_argument('--staging-dir', dest='staging_dir', action='store', required=True)
parser.add_argument('--image-size', dest='image_size', action='store', type=int)
parser.add_argument('--start-date', dest='start_date', action='store', type=dateutil.parser.parse, required=True)
parser.add_argument('--refresh', dest='refresh', action='store_true')
args = parser.parse_args()

if not os.path.exists(args.image_dir):
  sys.stderr.write('image-dir not found: ' + args.image_dir + '\n')

sys.stdout = codecs.getwriter('utf8')(sys.stdout)

print ('staging directory: %s' % args.staging_dir)
print ('publish start date: %s' % args.start_date.strftime(ISO_DATE))

count = 0
for filename in sys.stdin.readlines():
  filename = filename.strip()
  post = json.load(open(filename))
  post_data = parse_post(post)

  if post_data:
    content, link, hashtags, post_date, post_context, images = post_data
    post_dir = get_post_directory(args.staging_dir, post_date, args.start_date, post['url'])

    if not args.refresh and os.path.exists(post_dir):

    # Avoid exporting posts with stale links.
    if link:
      link = validate_url(link)
      if not link:
        print ('\nURL %s not found, skipping export for %s' % (post_data[1], post_dir))

    # Output content in Markdown format to staging location.
    if not os.path.exists(post_dir):
    content_file =, ''), 'w', encoding='utf-8')
    content_file.write(format_content(content, link, hashtags, post_date, post_context))

    for i, image in enumerate(images):
      source = os.path.join(args.image_dir, image)
      destination = os.path.join(post_dir, 'img_%d.jpg' % i)
      copy_downscale_image(source, destination, args.image_size)
    count += 1
print ('%d posts exported to %s' % (count, args.staging_dir))