Sunday, December 16, 2018

Google+ Migration - Part VIII: Export to Diaspora*

<- Part VII: Conversion & Staging

The last stage of the process is to finally export the converted posts to Diaspora* the chosen target system. As we want these post to appear slowly and close to their original post date anniversary, this process is going to be drawn out over at least one year.

While we could do this by hand, it should ideally be done by some automated process. For this to work, we need some kind of server-type machine that is up and running and connected to the Internet frequently enough during a whole year.

The resource requirements are quite small, except for storing the staged data which for some users could easily be in multiple gigabytes, mostly depending on the number posts with images.

Today it is quite easy to get small & cheap virtual server instances from any cloud provider, for example the micro sized compute engine instances on Google Cloud should be part of the free tier even.

I also still have a few of the small, low power Rasbperry Pi boards lying around, one of which has been mirroring my public G+ posts to Twitter since 2012 and is still active today.

An additional challenge is that Diaspora* does at this point not offer an official and supported API. The diaspy Python API package is essentially "screen-scraping" the callback handler URLs of the corresponding diaspora server and might break easily when the server is being upgraded to a new version, which is happening several times per year on a well maintained pod. For that reason, we are also adding additional support to send error logs including exception stack traces to an external email system so that we can hopefully notice quickly if/when something is going wrong.

I am planning to run the following script about every 3 hours on my network connected Raspberry Pi server using cron with the following crontab entry (see instructions for setting up a crontab entry):

PATH=/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/usr/sbin:/sbin
42 */3 * * * cd /home/pi/post_bot && ./post_bot.py --staging-dir=staging --login-info=logins.json --mail

This should run every 42nd minute of every hour divisible by 3 on every day, assuming there is a directory at /home/pi/post_bot containing the following script as post_bot.py, a sub-directory staging/ with the data generated using the process described in the previous episode and a file logins.json containing the login credentials for the diaspora pod and optionally an email service to be used for error notifications.

While storing passwords in clear-text on a server is a certifiably bad idea, we are at least avoiding to hard-code them in the script and storing them in a separate file instead, using JSON format, since we are already heavily using JSON for this project. The login credentials file has the following format, with the "mail" section being optional:

{
  "diaspora": {
    "pod-url": "<URL for diaspora pod, e.g. https://diasporing.ch>",
    "username": "<username valid on this pod>",
    "password": "<clear text password for diaspora pod account>"
   },
  "mail": {
    "smtp-server": "<SMTP mail server address, e.g. mail.gmx.net>",
    "username": "<username, typically email-address>",
    "password": "<clear text password for email account>",
    "recipient": "<recipient email address for error messages>"
  }
}


There are two ways to run this script: in a manual testing mode to upload a particular post, e.g. with ./post_bot.py --staging-dir=testing --login-info=logins.json --test=staging/20181021/20181021_4XBeoKCnV1N/ and the regular production mode to be called periodically, e.g. from cron e.g. as ./post_bot.py --staging-dir=staging --login-info=logins.json --mail. which auto-selects the next eligible post to be sent, if any.

For compatibility with the most recent version of diaspy, we are using python3 (e.g. install additionally with sudo apt-get install python3 python3-pip) and the install the additional packages with pip3 install python-dateutil requests diaspy-api bs4.

However, the latest package version of diaspy is already not working properly for image upload so it may be necessary to download the latest version directly from github and copy the contents of the "diaspy" subdirectory into /home/pi/post_bot as a local copy of the module.

As with any of the code snippets in this project, this is merely meant as an inspiration for your own implementations and not as a usable/finished product in any sense.

When posting to an active social media platform, we should also be very considerate of not overwhelming the stream with archive content and be ready to engage with readers also on automatically posted content, as the goal should be to create new connections and conversations.

#!/usr/bin/env python3

import argparse
import datetime
from email.mime.text import MIMEText
import glob
from io import StringIO
import json
import logging
import logging.handlers
import os
import smtplib
import shutil
import sys

import dateutil.parser
import diaspy

ISO_DATE = '%Y%m%d'
TOOL_NAME = 'G+ post-bot'

def send_error_message(txt, email_info):
  """Send a crash/error message to a configured email address."""
  server = smtplib.SMTP(email_info['smtp-server'])
  server.starttls()
  server.login(email_info['username'], email_info['password'])
  msg = MIMEText(txt)
  msg['From'] = email_info['username'] 
  msg['To'] =  email_info['recipient']
  msg['Subject'] = 'error message from %s on %s' % (TOOL_NAME, os.uname()[1])
  server.sendmail(email_info['username'], email_info['recipient'], msg.as_string())
  server.quit()
  

def post_to_diaspora(post_dir, login_info):
  """Load a post from staging directory and send to diaspora server."""
  cwd = os.getcwd()
  os.chdir(post_dir)
  content = open('content.md').read()
  images = sorted(glob.glob('img_*.jpg'))

  c = diaspy.connection.Connection(pod=login_info['pod-url'],
                                   username=login_info['username'],
                                   password=login_info['password'])
  c.login()
  stream = diaspy.streams.Stream(c)
  if not images:
    stream.post(content, provider_display_name = TOOL_NAME)
  else:
    ids = [stream._photoupload(name) for name in images]
    stream.post(content, photos=ids, provider_display_name=TOOL_NAME)
  os.chdir(cwd)

# --------------------
parser = argparse.ArgumentParser(description='Coolect post images referenced from a set of posts')
parser.add_argument('--staging-dir', dest='staging_dir', action='store', required=True)
parser.add_argument('--login-info', dest='login_info', action='store', required=True)
parser.add_argument('--test', dest='test_data', action='store')
parser.add_argument('--mail-errors', dest='mail', action='store_true')

args = parser.parse_args()

# Set up logging to both syslog and a memory buffer.
log_buffer = StringIO()    
logging.basicConfig(stream=log_buffer, level=logging.INFO)
logging.getLogger().setLevel(logging.INFO)
syslog = logging.handlers.SysLogHandler(address='/dev/log')
syslog.setFormatter(logging.Formatter('diaspora-post-bot: %(levelname)s %(message)s'))
logging.getLogger().addHandler(syslog)

# Load login/authentication data from a separate file.
login_info = json.load(open(args.login_info))

if not 'diaspora' in login_info:
  print('%s does not contain diaspora login section' % args.login_info)
  sys.exit(1)

if args.test_data:
  # Directly load a post staging directory to diaspora.
  post_to_diaspora(args.test_data, login_info['diaspora'])
else:
  # Find next post directory and load to diaspora.
  # Intended to run un-attended from cron-job or similar at periodic intervals (e.g. every 2h)
  try:
    logging.info('starting export from %s' % args.staging_dir)
    dirs = sorted(glob.glob(os.path.join(args.staging_dir, '[0-9]*')))
    if not dirs:
      logging.info('no more data to export')
      sys.exit(0)
    next_dir = dirs[0]

    # Check if post date for next staging directory has been reached.
    if dateutil.parser.parse(os.path.basename(next_dir)) > datetime.datetime.now():   
      logging.info('next dir not yet ready for export: %s' % os.path.basename(dirs[0]))
      sys.exit(0)
    logging.info('found next active staging directory %s' % next_dir)

    # Find next post in staging directory or delete staging directory when empty.
    posts = sorted(os.listdir(next_dir))
    if not posts:
      logging.info('deleting empty staging directory: %s' % next_dir)
      os.rmdir(next_dir)
      sys.exit(0)

    # Move exported posts to a backup directory.
    completion_dir = os.path.join(args.staging_dir, 'completed')
    if not os.path.exists(completion_dir):
      os.makedirs(completion_dir)

    # Send next post to diaspora server.
    post_dir = os.path.join(next_dir, posts[0])
    logging.info('posting %s...' % post_dir)
    post_to_diaspora(post_dir, login_info['diaspora'])
    shutil.move(post_dir, completion_dir)
    logging.info('post completed')
    sys.exit(0)
  except (KeyboardInterrupt, SystemExit):
    sys.exit(1) 
  except Exception as e:
    logging.exception('error in main loop')
    if args.mail and 'mail' in login_info:
      send_error_message(log_buffer.getvalue(), login_info['mail'])
    sys.exit(1)