Saturday, April 20, 2019

Email to Disaspora* posting Bot

What I still miss the most after moving from G+ to Diaspora* for a my casual public social network posting is a well integrated mobile app for posting on the go.

The main use-case for me is posting photos on the go, which I now mostly take on my cellphone and minimally process with Google Photos.

One of the problems with the mobile app for Diaspora* (Dandelion in the case of Android) is that the size limit for photo uploads is quite small compared to the resolution of todays cellphone cameras. There is also not much point of uploading  high-resolution images for purely on-screen consumption to an infrastructure managed by volunteers on a shoestring budget. I also liked the ability to geo-tag the mobile posts by explicitly selecting a nearby landmark to obfuscate a bit the current location.

For a few weeks now, I have been sharing my account with a G+ archive bot that is uploading recycled posts from the takeout archive (see here for the first part of the series describing the process). I like the structured formatting and meta-data tags that come from automated processing and since my bot seems to be getting more likes that I do, I am thinking why not keep it around?

I am a heavy email user and email clients are well integrated into the sharing functions of both Android and IOS mobile platforms. Since the posting bot is already using a free web-mail account for error reporting it would be easy to use the same account for sending emails to the bot for post-processing and posting. Only emails originating from my own address(es) should be converted into a post. Thanks to DKIM domain authentication used by most major email providers today, we can somewhat trust the authenticity of the sender information in the header.

This new bot is using the POP3 protocol to access the inbox of the online hosted email account, download the emails, check the senders and extract the plain text and image attachment parts in particular. If available, Exif GPS data is extracted from the images and reverse-geocoded using OpenStreetMap to the rough neighborhood of where the image was taken (see previous post). The images are rotated and scaled to a maximum size for upload. Some simple, hard-coded "business rules" are used to generated additional hashtags for some of common use-cases - primarily photo sharing or link sharing.

The post is then staged the same format and directory structure as for the takeout archive processor so that the same posting bot can be re-used.

Similarly, we can run the new the combination of email processor and diaspora exporter from the crontab on a Raspberry Pi or some other linux based always-on server platform:
PATH=/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/usr/sbin:/sbin
19 * * * * /home/pi/mail_bot/mail_bot.sh
Where the mail_bot.sh script is as follows:
#!/bin/sh

cd /home/pi/mail_bot
./mail_bot.py --login-info=./logins.json --staging-dir=./staging  --mail-errors
/home/pi/post_bot/post_bot.py --staging-dir=./staging --login-info=./logins.json --mail-errors
The email processing component is in mail_bot.py below. It depends on the module exif2hashtag.py from the previous post as well as on the additional packages dateutil, dkimpy and PIL/Pillow, which can again be installed as pip3 python-dateutil dkimpy Pillow.

The mail section in the logins.json file requires two additional 'pop-server' with the name or address of the email accounts pop3 service and 'authorized-senders' with a list of email addresses wholes messages will be transformed into Diaspora* posts.
#!/usr/bin/env python3

import argparse
import datetime
import email
from email.mime.text import MIMEText
import io
from io import StringIO
from io import BytesIO
import json
import logging
import logging.handlers
import os
import poplib
import shutil
import smtplib
import sys

import dateutil.parser
import dkim
import html2text
import PIL.Image 

import exif2hashtag

ISO_DATE = '%Y%m%d'
ISO_DATETIME = ISO_DATE + '_%H%M%S'

# Extra hashtags for the sites I might be posting links from a mobile reader.
SITES = {
  'www.republik.ch' : ['Republik', 'News', 'media', 'lang_de', 'CH', 'Switzerland'],
  'www.tagesanzeiger.ch' : ['Tagesanzeiger', 'news', 'media', 'lang_de', 'CH', 'Switzerland'],
  'www.youtube.com' : ['YouTube'],
  'wikipedia.org' : ['Wikipedia'],
  'blog.kugelfish.com' : ['Blog', 'mywork', 'CC-BY', 'technology', 'programming'],
}

def send_error_message(txt, email_info):
  """Send a crash/error message to a configured email address."""
  server = smtplib.SMTP(email_info['smtp-server'])
  server.starttls()
  server.login(email_info['username'], email_info['password'])
  msg = MIMEText(txt)
  msg['From'] = email_info['username'] 
  msg['To'] =  email_info['recipient']
  msg['Subject'] = 'error message from %s on %s' % ('mail-bot', os.uname()[1])
  server.sendmail(email_info['username'], email_info['recipient'], msg.as_string())
  server.quit()

def validate(authorized_senders, sender, msg):
  """Check DKIM message signature and whether message is from an approved sender."""
  if not dkim.verify(msg):
    return False
  for s in authorized_senders:
    if s in sender:
      return True
  return False

def header_decode(hdr):
  """Decode RFC2047 headers into unicode strings."""
  str, enc = email.header.decode_header(hdr)[0]
  if enc:
    return str.decode(enc)
  else:
    return str

def export_image(img, outdir, num, max_size):
  """Reformat and stage image for posting to diaspora."""
  exif_info = exif2hashtag.get_exif_info(img)
  gps_info = exif2hashtag.get_gps_info(exif_info)
  latlon = exif2hashtag.get_latlon(gps_info)
  orientation = exif_info.get('Orientation', None)
  if orientation:
    if orientation == 3:
      img=img.rotate(180, expand=True)
    elif orientation == 6:
      img=img.rotate(270, expand=True)
    elif orientation == 8:
      img=img.rotate(90, expand=True)

  destination = os.path.join(outdir, 'img_%d.jpg' % num)
  source_size = max(img.size[0], img.size[1])
  if max_size and source_size >= max_size:
    scale = float(max_size) / float(source_size)
    img = img.resize((int(img.size[0] * scale), int(img.size[1] * scale)), PIL.Image.LANCZOS)
  img.save(destination, 'JPEG')
  return exif2hashtag.get_location_hashtags(latlon)


def export_message(msg, outdir, image_size):
  """Stage message for posting to diaspora."""
  hashtags = ['mailbot']
  content = []
  title = header_decode(msg.get('Subject'))
  if title:
    content.append('### ' + title)
    content.append('')
  img_count = 0
  for part in msg.walk():
    if part.get_content_type() == 'text/html':
      txt = part.get_payload(decode=True).decode("utf-8")
      for str, tags in SITES.items():
        if str in txt:
          hashtags.extend(tags)
      converter = html2text.HTML2Text()
      converter.ignore_links = True
      converter.body_width = 0
      content.append(converter.handle(txt))
    elif part.get_content_type() == 'text/plain':
      
    elif part.get_content_type() == 'image/jpeg':
      img_count += 1
      data = BytesIO()
      data.write(part.get_payload(decode=True))
      data.seek(0)
      img = PIL.Image.open(data)
      for tag in export_image(img, outdir, img_count, image_size):
        if not tag in hashtags:
          hashtags.append(tag)
  
  if img_count > 0:
    hashtags = ['photo', 'photography', 'foto',  'myphoto', 'CC-BY'] + hashtags

  if hashtags:
    content.append(' '.join(('#' + tag for tag in hashtags)))

  content_file = io.open(os.path.join(outdir, 'content.md'), 'w', encoding='utf-8')
  content_file.write('\n'.join(content))
  content_file.close() 

#---------------------
parser = argparse.ArgumentParser(description='Coolect post images referenced from a set of posts')
parser.add_argument('--staging-dir', dest='staging_dir', action='store', required=True)
parser.add_argument('--login-info', dest='login_info', action='store', required=True)
parser.add_argument('--image-size', dest='image_size', action='store', type=int, default=1024)
parser.add_argument('--mail-errors', dest='mail', action='store_true')

args = parser.parse_args()

# Set up logging to both syslog and a memory buffer.
log_buffer = StringIO()    
logging.basicConfig(stream=log_buffer, level=logging.INFO)
logging.getLogger().setLevel(logging.INFO)
syslog = logging.handlers.SysLogHandler(address='/dev/log')
syslog.setFormatter(logging.Formatter('diaspora-mail-bot: %(levelname)s %(message)s'))
logging.getLogger().addHandler(syslog)

try:
  # Load login/authentication data from a separate file.
  login_info = json.load(open(args.login_info))
  email_info = login_info['mail']

  pop3 = poplib.POP3_SSL(email_info['pop-server'])
  pop3.user(email_info['username'])
  auth = pop3.pass_(email_info['password'])
  msg_count = pop3.stat()[0]

  logging.info('%d new messages on %s' % (msg_count, email_info['pop-server']))

  for msg_num in range(1, msg_count + 1):
    msg_txt = b'\n'.join(pop3.retr(msg_num)[1])
    msg = email.message_from_bytes(msg_txt)
    sender = msg.get('From')
    subject = msg.get('Subject')

    if not validate(email_info['authorized-senders'], sender, msg_txt):
      logging.info('dropping message from unauthorized sender "%s" - subject: "%s"' % (sender, subject))
      pop3.dele(msg_num)
      continue

    timestamp = dateutil.parser.parse(msg.get('Date'))
    outdir = os.path.join(args.staging_dir, timestamp.strftime(ISO_DATE), timestamp.strftime(ISO_DATETIME))
    if not os.path.exists(outdir):
      os.makedirs(outdir)

    try:
      export_message(msg, outdir, args.image_size)
      pop3.dele(msg_num)
    except:
      logging.info('error exporting msg %d - deleting directory %s' % (msg_num, outdir))
      shutil.rmtree(outdir, ignore_errors=True)
      raise
  pop3.quit()

except (KeyboardInterrupt, SystemExit):
  sys.exit(1) 
except Exception as e:
  logging.exception('error in main loop')
  if args.mail and 'mail' in login_info:
    send_error_message(log_buffer.getvalue(), login_info['mail'])
  sys.exit(1)


Thursday, April 18, 2019

Extracting location information from Photos

Photos exported from digital cameras often contain meta-data in Exif format (Exchangeable Image File Format). For images taken with cellphone cameras, this info typically also includes (GPS) location information of where the photo was taken.

Inspired by this previous post on the mapping of GPS lat/lon coordinates from Google+ location data to a rough description of the location, we could also use the location encoded in the photo itself.

We are using again the reverse geocoding service from OpenStreetMap to find the names of the country and locality in which the GPS coordinates are included in.

For the purpose of public posting, reducing the accuracy of the GPS location to the granularity of the city town or village provides some increased confidentiality of where the picture was taken compared to the potentially meter/centimeter resolution accuracy of GPS data that generally allows to pinpoint the location down to a building and street address.

Fractional numbers are represented as ratios of integers in Exif. For example the number 0.5 could be encoded as the tuple (5, 10). The coordinates in the Exif location meta-data are represented in the DMS (Degrees Minutes Seconds) format which needs to to be converted into the DD (decimal degree) format used by most GIS systems including OpenStreetMap.


#!/usr/bin/env python

import sys

import geopy
import PIL.Image 
import PIL.ExifTags
import pycountry

geocoder = geopy.geocoders.Nominatim(user_agent='gplus_migration', timeout=None)

def get_location_hashtags(latlon):
  """Reverse geo-code lat/lon coordinates ISO-code / country / municipality names."""
  hashtags = []
  if latlon:
    addr = geocoder.reverse((latlon[0], latlon[1])).raw
    if 'address' in addr:
      addr = addr['address']
      cc = addr['country_code'].upper()
      hashtags.append(cc)
      hashtags.append(pycountry.countries.get(alpha_2=cc).name.replace(' ',''))
      for location in ['city', 'town', 'village']:
        if location in addr:
          hashtags.append(addr[location].replace(' ', ''))
  return hashtags

def get_exif_info(img):
  """Decode Exif data in image."""
  ret = {}
  info = img._getexif()
  if not info:
    return ret
  for tag, value in info.items():
    decoded = PIL.ExifTags.TAGS.get(tag, tag)
    ret[decoded] = value
  return ret

def get_gps_info(info):
  """Decode GPSInfo sub-tags in Exif data."""
  ret = {}
  if not info or not 'GPSInfo' in info:
    return ret
  for tag, value in info['GPSInfo'].items():
    decoded = PIL.ExifTags.GPSTAGS.get(tag, tag)
    ret[decoded] = value
  return ret

def degrees_from_ratios(ratios):
  """Convert from Exif d/m/s array of ratios to floating point representation."""
  f = [(float(r[0]) / float(r[1])) for r in ratios]
  return f[0] + f[1] / 60.0 + f[2] / 3600.0

def get_latlon(gps_info):
  """Extract the GPS coordinates from the GPS Exif data and convert into fractional coordinates."""
  lat = gps_info.get('GPSLatitude', None)
  lat_hemi = gps_info.get('GPSLatitudeRef', None)
  lon = gps_info.get('GPSLongitude', None)
  lon_hemi = gps_info.get('GPSLongitudeRef', None)
  if lat and lat_hemi and lon and lon_hemi:
    return (degrees_from_ratios(lat) * (-1 if lat_hemi == 'S' else 1),
            degrees_from_ratios(lon) * (-1 if lon_hemi == 'W' else 1))
  else:
    return None

def get_camera(info):
  """Get Camera make & model as another example of Exif data."""
  if 'Make' in info and 'Model' in info:
    return '%s %s' % (info['Make'], info['Model'])
  else:
    return None
  
#------------------------------------------------------

for filename in sys.argv[1:]:
  image = PIL.Image.open(filename)
  exif_info = get_exif_info(image)
  gps_info = get_gps_info(exif_info)
  latlon = get_latlon(gps_info)
  print ('%s : %s %s' % (filename, get_camera(exif_info), get_location_hashtags(latlon)))

Friday, December 28, 2018

The Fallacy of distributed = good

I have recently been looking for an alternative social media platform and started using Diaspora* via the diasporing.ch pod. Not unlike the cryptocurrency community, the proponents of the various platforms in the Fediverse seem to rather uncritically advocate the distributed nature of these platforms as an inherently positive property in particular when it comes to privacy and data protection.

I tend to agree with Yuval Harari who argues in "Sapiens"  that empires or scaled, centralized forms of organization are one of Homo Sapiens' significant cultural accomplishments. A majority of humans through history have lived as part of some sort of empire. Empires can provide prosperity and ensure lasting peace and stability - like the Pax Romana or in my generation, the Pax Americana. We often have a love/hate relationship with empires - even many protesters who are busy burning American flags during the day, secretly hope that their children some day will get into Harvard and have a better life. Libertarians seem to think of a land without much central governance as a place where strong individuals can realize their dreams of freedom and prosperity - like the romanticized frontier world of daytime TV westerns. My cynical self imagines rather a kind of post-apocalyptic Mad Max world, where our school bullies become sadistic local warlords. In European history, the thousand years after the fall of the Roman Empire, which featured highly distributed power structures are called the dark middle ages for good reasons.

In the online world, many of us love to hate the big social media platforms and yet billions of us return there every month. Maybe because this is where our friends, family and everybody else are as well or because they generally offer a smooth and polished service or are good a giving us what we want? When it comes to security, the largest platforms can afford to invest more in security and have impressively competent security and operations teams to protect our data from being compromised. As the Roman legions, they do not always succeed and their failures are highly publicized. But more often than not, they succeed and looking at it dispassionately, our data is probably nowhere as save as with one of the large providers of online or cloud services. Yes, the large platforms are driven by commercial interests, which also makes them  predictable as they have a lot to loose and tend to follow laws with pedantic sophistication.

I fail to see how a distributed architecture alone should inherently improve privacy or data protection. For most of us in the Fediverse, our pod-admins rather than us are de facto in possession and control of our data. De jure, they don't have to worry much about data protection and privacy laws because they are too small to be on the radar of any regulatory agency. Pods can disappear from the network at any time without warning and account migration between pods is generally not trivial, if possible at all (For example, Diaspora* currently allows profile export, but not yet import into another pod).

On the bright side, the Fediverse would allows any of us who are tech-savvy and dedicated enough to run our own pod and become the admins of our own data and lords of our own domain. But in reality, how many of us are really doing this?

For the rest of us, what is left to do is to choose carefully which pod to join. Maybe one that is run by more than one person, a cooperative, club or association? Maybe see whether we could contribute to its operation either financially or through volunteering. And always be nice to our pod-admins, not just because they essentially own our social-media persona, but because they generally do a tedious and thankless labor of love and on top of that most likely also bear the brunt of the financial burden.

While architecturally, operationally and/or organizationally distributed systems maybe interesting and may have some advantages as well as disadvantages, we should not automatically assume that they are better just because they are distributed.

Sunday, December 16, 2018

Google+ Migration - Part VIII: Export to Diaspora*

<- Part VII: Conversion & Staging

The last stage of the process is to finally export the converted posts to Diaspora* the chosen target system. As we want these post to appear slowly and close to their original post date anniversary, this process is going to be drawn out over at least one year.

While we could do this by hand, it should ideally be done by some automated process. For this to work, we need some kind of server-type machine that is up and running and connected to the Internet frequently enough during a whole year.

The resource requirements are quite small, except for storing the staged data which for some users could easily be in multiple gigabytes, mostly depending on the number posts with images.

Today it is quite easy to get small & cheap virtual server instances from any cloud provider, for example the micro sized compute engine instances on Google Cloud should be part of the free tier even.

I also still have a few of the small, low power Rasbperry Pi boards lying around, one of which has been mirroring my public G+ posts to Twitter since 2012 and is still active today.

An additional challenge is that Diaspora* does at this point not offer an official and supported API. The diaspy Python API package is essentially "screen-scraping" the callback handler URLs of the corresponding diaspora server and might break easily when the server is being upgraded to a new version, which is happening several times per year on a well maintained pod. For that reason, we are also adding additional support to send error logs including exception stack traces to an external email system so that we can hopefully notice quickly if/when something is going wrong.

I am planning to run the following script about every 3 hours on my network connected Raspberry Pi server using cron with the following crontab entry (see instructions for setting up a crontab entry):

PATH=/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/usr/sbin:/sbin
42 */3 * * * cd /home/pi/post_bot && ./post_bot.py --staging-dir=staging --login-info=logins.json --mail

This should run every 42nd minute of every hour divisible by 3 on every day, assuming there is a directory at /home/pi/post_bot containing the following script as post_bot.py, a sub-directory staging/ with the data generated using the process described in the previous episode and a file logins.json containing the login credentials for the diaspora pod and optionally an email service to be used for error notifications.

While storing passwords in clear-text on a server is a certifiably bad idea, we are at least avoiding to hard-code them in the script and storing them in a separate file instead, using JSON format, since we are already heavily using JSON for this project. The login credentials file has the following format, with the "mail" section being optional:

{
  "diaspora": {
    "pod-url": "<URL for diaspora pod, e.g. https://diasporing.ch>",
    "username": "<username valid on this pod>",
    "password": "<clear text password for diaspora pod account>"
   },
  "mail": {
    "smtp-server": "<SMTP mail server address, e.g. mail.gmx.net>",
    "username": "<username, typically email-address>",
    "password": "<clear text password for email account>",
    "recipient": "<recipient email address for error messages>"
  }
}


There are two ways to run this script: in a manual testing mode to upload a particular post, e.g. with ./post_bot.py --staging-dir=testing --login-info=logins.json --test=staging/20181021/20181021_4XBeoKCnV1N/ and the regular production mode to be called periodically, e.g. from cron e.g. as ./post_bot.py --staging-dir=staging --login-info=logins.json --mail. which auto-selects the next eligible post to be sent, if any.

For compatibility with the most recent version of diaspy, we are using python3 (e.g. install additionally with sudo apt-get install python3 python3-pip) and the install the additional packages with pip3 install python-dateutil requests diaspy-api bs4.

However, the latest package version of diaspy is already not working properly for image upload so it may be necessary to download the latest version directly from github and copy the contents of the "diaspy" subdirectory into /home/pi/post_bot as a local copy of the module.

As with any of the code snippets in this project, this is merely meant as an inspiration for your own implementations and not as a usable/finished product in any sense.

When posting to an active social media platform, we should also be very considerate of not overwhelming the stream with archive content and be ready to engage with readers also on automatically posted content, as the goal should be to create new connections and conversations.

#!/usr/bin/env python3

import argparse
import datetime
from email.mime.text import MIMEText
import glob
from io import StringIO
import json
import logging
import logging.handlers
import os
import smtplib
import shutil
import sys

import dateutil.parser
import diaspy

ISO_DATE = '%Y%m%d'
TOOL_NAME = 'G+ post-bot'

def send_error_message(txt, email_info):
  """Send a crash/error message to a configured email address."""
  server = smtplib.SMTP(email_info['smtp-server'])
  server.starttls()
  server.login(email_info['username'], email_info['password'])
  msg = MIMEText(txt)
  msg['From'] = email_info['username'] 
  msg['To'] =  email_info['recipient']
  msg['Subject'] = 'error message from %s on %s' % (TOOL_NAME, os.uname()[1])
  server.sendmail(email_info['username'], email_info['recipient'], msg.as_string())
  server.quit()
  

def post_to_diaspora(post_dir, login_info):
  """Load a post from staging directory and send to diaspora server."""
  cwd = os.getcwd()
  os.chdir(post_dir)
  content = open('content.md').read()
  images = sorted(glob.glob('img_*.jpg'))

  c = diaspy.connection.Connection(pod=login_info['pod-url'],
                                   username=login_info['username'],
                                   password=login_info['password'])
  c.login()
  stream = diaspy.streams.Stream(c)
  if not images:
    stream.post(content, provider_display_name = TOOL_NAME)
  else:
    ids = [stream._photoupload(name) for name in images]
    stream.post(content, photos=ids, provider_display_name=TOOL_NAME)
  os.chdir(cwd)

# --------------------
parser = argparse.ArgumentParser(description='Coolect post images referenced from a set of posts')
parser.add_argument('--staging-dir', dest='staging_dir', action='store', required=True)
parser.add_argument('--login-info', dest='login_info', action='store', required=True)
parser.add_argument('--test', dest='test_data', action='store')
parser.add_argument('--mail-errors', dest='mail', action='store_true')

args = parser.parse_args()

# Set up logging to both syslog and a memory buffer.
log_buffer = StringIO()    
logging.basicConfig(stream=log_buffer, level=logging.INFO)
logging.getLogger().setLevel(logging.INFO)
syslog = logging.handlers.SysLogHandler(address='/dev/log')
syslog.setFormatter(logging.Formatter('diaspora-post-bot: %(levelname)s %(message)s'))
logging.getLogger().addHandler(syslog)

# Load login/authentication data from a separate file.
login_info = json.load(open(args.login_info))

if not 'diaspora' in login_info:
  print('%s does not contain diaspora login section' % args.login_info)
  sys.exit(1)

if args.test_data:
  # Directly load a post staging directory to diaspora.
  post_to_diaspora(args.test_data, login_info['diaspora'])
else:
  # Find next post directory and load to diaspora.
  # Intended to run un-attended from cron-job or similar at periodic intervals (e.g. every 2h)
  try:
    logging.info('starting export from %s' % args.staging_dir)
    dirs = sorted(glob.glob(os.path.join(args.staging_dir, '[0-9]*')))
    if not dirs:
      logging.info('no more data to export')
      sys.exit(0)
    next_dir = dirs[0]

    # Check if post date for next staging directory has been reached.
    if dateutil.parser.parse(os.path.basename(next_dir)) > datetime.datetime.now():   
      logging.info('next dir not yet ready for export: %s' % os.path.basename(dirs[0]))
      sys.exit(0)
    logging.info('found next active staging directory %s' % next_dir)

    # Find next post in staging directory or delete staging directory when empty.
    posts = sorted(os.listdir(next_dir))
    if not posts:
      logging.info('deleting empty staging directory: %s' % next_dir)
      os.rmdir(next_dir)
      sys.exit(0)

    # Move exported posts to a backup directory.
    completion_dir = os.path.join(args.staging_dir, 'completed')
    if not os.path.exists(completion_dir):
      os.makedirs(completion_dir)

    # Send next post to diaspora server.
    post_dir = os.path.join(next_dir, posts[0])
    logging.info('posting %s...' % post_dir)
    post_to_diaspora(post_dir, login_info['diaspora'])
    shutil.move(post_dir, completion_dir)
    logging.info('post completed')
    sys.exit(0)
  except (KeyboardInterrupt, SystemExit):
    sys.exit(1) 
  except Exception as e:
    logging.exception('error in main loop')
    if args.mail and 'mail' in login_info:
      send_error_message(log_buffer.getvalue(), login_info['mail'])
    sys.exit(1)



Sunday, December 9, 2018

Google+ Migration - Part VII: Conversion & Staging

<- Part VI: Location, Location, Location 

We are now ready to put all the pieces together for exporting to Diaspora*, the new target platform.

If we had some sort of "Minitrue" permissions to rewrite history on the target system, the imported posts could appear to always have been there since their original G+ posting date.

However since we have only have regular user permissions, the only choice is to post them as new posts at some future point in time. The most straightforward way to upload the archive would be to re-post in chronological order as quickly as possible without causing overload.

If the new account is not only used for archive purposes, we may want to maximize the relevance of the archive posts in the new stream. In this case, a better way would be to post each archive post on the anniversary of its original post-date, creating some sort of "this day in history" series. This would require that the upload activity needs to be scheduled over at least a year, causing some operational challenges.

In order to minimize the risk of things going wrong with with generating the new posts during this drawn out,  hopefully unattended and automated posting process, we are trying to do as much of the conversion in a single batch and stage the converted output to be uploaded/posted to the destination system at some planned future time. This would also allow for easier inspection of the generated output or to adapt the process for a different destination system, e.g. a blog.

The following python script read a list of post filenames from the takeout archive, extracts relevant information from the JSON object in each file and generates the new post content in Markdown format. Besides being the input format for Diaspora*, Markdown is widely used and can also easily be converted into other formats, including HTML. The list of posts we want to export can be generated using the post_filter.py script from part IV in this series. We also have downloaded the images references in any of these posts using the image_cache.py script from part V and stored them in a location like /tmp/images.

Most of my posts are either photo or link sharing, with just a line or two of commentary. More towards a twitter use-case than he long-form posts that G+ would support equally well. The script contains several assumptions that are optimized for this use-case. For example HTML links are stripped from the text content, assuming that each post only has one prominent link that is being shared. Many of my photo sharing posts contain location information, which is extracted here into additional hashtags as well as an additional location link on OpenStreetMap.

Hashtags are a more central concept on Diaspora* than they were on G+. Other than some static pre-defined hashtags to identify the posts as an automated repost from G+, there are additional hashtags that are added based on the type of post - e.g. photo sharing, stripped down re-sharing of another post, sharing of a YouTube video or high level geo location info.

Before running the conversion & staging script, we need to decide which day in the future we want to start posting the archive. Given an staging directory, e.g. /tmp/staging_for_diaspora, the script will create a sub-directory for each day that contains scheduled post activity. In each daily schedule directory, the script creates a unique sub-directory containing a content.md file with the new post text in Markdown as well as any images to be attached. The unique name for each post consists of the date of the original post data plus what seems to be a unique ID in the post URL, in absence of a real unique post ID in the JSON file. For example a post originally posted on Jul 14 2018, would be stored in /tmp/stage_for_diaspora/20190714/20180714_C3RUWSDE7X7/content.md formatted as:

Port Authority Inland Terminal - from freight hub to Internet switching center.

#repost #bot #gplusarchive #googleplus #throwback #photo #photography #US #UnitedStates #NYC

[111 8th Ave](https://www.openstreetmap.org/?lat=40.7414688&lon=-74.0033873&zoom=17)
Originally posted Sat Jul 14, 2018 on Google+ (Alte St├Ądte / Old Towns)

Or the post which shared the link to the first part of this series would be re-formatted as:

Starting to document the process of migrating my public post stream to diaspora*.  
  
The plan is to process the takeout archive in Python and generate (somewhat) equivalent diaspora* posts using diaspy.  

#repost #bot #gplusarchive #googleplus #throwback

Originally posted Sun Oct 21, 2018 on Google+  (Google+ Mass Migration)

https://blog.kugelfish.com/2018/10/google-migration-part-i-takeout.html

The script also checks the current status of link URLs to avoid sharing a broken link. While we tell our children to be careful since "the Internet never forgets", in reality many links are gone after just a few years - the whole G+ site soon being an example of that.

Since Disapora* is not particularly well optimized for photo-processing and to help save storage cost on the pod server, the script can also downscale images to a fixed maximum size that is suitable for on-screen display.

For example by running the script as
./post_transformer.py --image-dir=/tmp/images --staging-dir=/tmp/stage_for_diaspora --start-date=20191001 --image-size=1024 < /tmp/public_posts.txt
we are assuming that we want to start publishing on Oct 1 2019 that images are located in /tmp/images and should be limited to a maximum size of 1024 pixels for publishing and the whole output will be staged in /tmp/stage_for_diaspora.

Since this script does not do any posting itself, we can run it as many times as we need to, inspect the output and make some adjustments as necessary. Link URL checking and geo-coding (see part VI) require network access from the machine where the script is being executed. In principle, we could manually post the generated output to some target system, but in a future episode, we will demonstrated a fully automated way of posting to diaspora, assuming that

In addition to what is already included in the Python standard library (2.7) we need the following additional packages:
Which can be installed for example using PIP: pip install python-dateutil geopy html2text Pillow pycountry requests


#!/usr/bin/env python
# -*- coding: utf-8 -*-

import argparse
import codecs
import datetime
import io
import json
import os
import sys

import dateutil.parser
import geopy.geocoders
import html2text
import PIL.Image 
import pycountry
import requests

ISO_DATE = '%Y%m%d'

HASHTAGS = ['repost', 'bot', 'gplusarchive', 'googleplus', 'throwback']

geocoder = geopy.geocoders.Nominatim(user_agent='gplus_migration', timeout=None)

def get_location_hashtags(loc):
  """Return hashtags related to the location of the post: ISO country code, country name, city/town."""
  hashtags = []
  if 'latitude' in loc and 'longitude' in loc:
    addr = geocoder.reverse((loc['latitude'], loc['longitude'])).raw
    if 'address' in addr:
      addr = addr['address']
      cc = addr['country_code'].upper()
      hashtags.append(cc)
      hashtags.append(pycountry.countries.get(alpha_2=cc).name.replace(' ',''))
      for location in ['city', 'town', 'village']:
        if location in addr:
          hashtags.append(addr[location].replace(' ', ''))
  return hashtags


def get_location_link(loc):
  """Return a link to OpenStreetMap for the post location."""
  if 'latitude' in loc and 'longitude' in loc and 'displayName' in loc:
    map_url = ('https://www.openstreetmap.org/?lat=%s&lon=%s&zoom=17' % (loc['latitude'], loc['longitude']))
    return '[%s](%s)' % (loc['displayName'], map_url)
  else:
    return None


def validate_url(url):
  """Veify whether a URL still exists, including a potential redirect."""
  user_agent = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5)'
                 + ' AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
  try:
    r = requests.get(url, headers=user_agent)
    if r.status_code != 200:
      return None
    return r.url
  except requests.ConnectionError:
    return None
  

def get_image_name(resource_name):
  """Generate image cache name for media resource-name."""
  return resource_name.replace('media/', '', 1) + '.jpg'


def copy_downscale_image(source, destination, max_size):
  """Copy a downscaled version of the image to the staging location."""
  img = PIL.Image.open(source)
  source_size = max(img.size[0], img.size[1])
  if not max_size or source_size <= max_size:
    img.save(destination, 'JPEG2000') 
  else:
    scale = float(max_size) / float(source_size)
    img = img.resize((int(img.size[0] * scale), int(img.size[1] * scale)), PIL.Image.LANCZOS)
    img.save(destination, 'JPEG2000')


def parse_post(post_json):
  """Extract relevant information from a JSON formatted post."""
  post_date = dateutil.parser.parse(post_json['creationTime'])
  content = post_json['content'] if 'content' in post_json else ''
  link = post_json['link']['url'] if 'link' in post_json else ''

  hashtags = HASHTAGS[:] # make a copy
  images = []

  if 'media' in post_json:
    media = post_json['media']
    if media['contentType'] == 'video/*' and 'youtube' in media['url']:
    # if the media is a youtube URL, convert into a link-sharing post
      link = media['url']
      hashtags = hashtags + ['video', 'YouTube']
    elif media['contentType'] == 'image/*':
      hashtags.extend(['photo', 'photography'])
      images.append(get_image_name(media['resourceName']))
    else:
      return None # unsupported media format

  if 'album' in post_json:
    hashtags = hashtags + ['photo', 'photography']
    for image in post['album']['media']:
      if image['contentType'] == 'image/*':
        images.append(get_image_name(image['resourceName']))
    if len(images) == 0:
      return None # no supported image attachment in album

  # If a shared post contains a link, extract that link
  # and give credit to original poster.
  if 'resharedPost' in post_json:
    if 'link' in post_json['resharedPost']:
      link = post_json['resharedPost']['link']['url']
      content = content + ' - H/t to ' + post_json['resharedPost']['author']['displayName']
      hashtags.append('reshared')
    else:
      return None # reshare without a link attachment

  acl = post_json['postAcl']
  post_context = {}
  if 'communityAcl' in acl:
    post_context['community'] = acl['communityAcl']['community']['displayName']

  if 'location' in post_json:
    hashtags.extend(get_location_hashtags(post_json['location']))
    location_link = get_location_link(post_json['location'])
    if location_link:
      post_context['location'] = location_link

  return (content, link, hashtags, post_date, post_context, images)


def format_content(content, link, hashtags, post_date, post_context):
  """Generated a Markdown formatted string from the pieces of a post."""
  output = []
  if content:
    converter = html2text.HTML2Text()
    converter.ignore_links = True
    converter.body_width = 0
    output.append(converter.handle(content))
  if hashtags:
    output.append(' '.join(('#' + tag for tag in hashtags)))
    output.append('')
  if 'location' in post_context:
    output.append(post_context['location'])
  if post_date:
    output.append('Originally posted %s on Google+ %s' 
                    % (post_date.strftime('%a %b %d, %Y'),
                       '  (' + post_context['community'] + ')' if 'community' in post_context else ''))
    output.append('')
  if link:
    output.append(link)
    output.append('')
  return u'\n'.join(output)


def get_post_directory(outdir, post_date, start_date, url):
  """Generate staging output directory based on schedule date & post unique ID."""
  post_id = post_date.strftime(ISO_DATE) + '_' + url.split('/')[-1]
  schedule_date = post_date.replace(year=start_date.year, tzinfo=None)
  if schedule_date < start_date:
    schedule_date = schedule_date.replace(year=schedule_date.year + 1)
  return os.path.join(outdir, schedule_date.strftime(ISO_DATE), post_id)
  

# --------------------
parser = argparse.ArgumentParser(description='Coolect post images referenced from a set of posts')
parser.add_argument('--image-dir', dest='image_dir', action='store', required=True)
parser.add_argument('--staging-dir', dest='staging_dir', action='store', required=True)
parser.add_argument('--image-size', dest='image_size', action='store', type=int)
parser.add_argument('--start-date', dest='start_date', action='store', type=dateutil.parser.parse, required=True)
parser.add_argument('--refresh', dest='refresh', action='store_true')
args = parser.parse_args()

if not os.path.exists(args.image_dir):
  sys.stderr.write('image-dir not found: ' + args.image_dir + '\n')
  sys.exit(-1)

sys.stdout = codecs.getwriter('utf8')(sys.stdout)

print ('staging directory: %s' % args.staging_dir)
print ('publish start date: %s' % args.start_date.strftime(ISO_DATE))

count = 0
for filename in sys.stdin.readlines():
  filename = filename.strip()
  post = json.load(open(filename))
  post_data = parse_post(post)

  if post_data:
    content, link, hashtags, post_date, post_context, images = post_data
    post_dir = get_post_directory(args.staging_dir, post_date, args.start_date, post['url'])

    if not args.refresh and os.path.exists(post_dir):
      continue

    # Avoid exporting posts with stale links.
    if link:
      link = validate_url(link)
      if not link:
        print ('\nURL %s not found, skipping export for %s' % (post_data[1], post_dir))
        continue

    # Output content in Markdown format to staging location.
    if not os.path.exists(post_dir):
      os.makedirs(post_dir)
     
    content_file = io.open(os.path.join(post_dir, 'content.md'), 'w', encoding='utf-8')
    content_file.write(format_content(content, link, hashtags, post_date, post_context))
    content_file.close()

    for i, image in enumerate(images):
      source = os.path.join(args.image_dir, image)
      destination = os.path.join(post_dir, 'img_%d.jpg' % i)
      copy_downscale_image(source, destination, args.image_size)
      
    count += 1
    sys.stdout.write('.')
    sys.stdout.flush()
    
print ('%d posts exported to %s' % (count, args.staging_dir))    


Thursday, November 29, 2018

Google+ Migration - Part VI: Location, Location, Location!

<- Image Attachments

Before we focus on putting all the pieces together, here a small, optional excursion into how to make use of location information contained in G+ posts.

We should consider carefully if and how we want to include geo location information as there might be privacy and safety implications. For such locations, it can make sense to choose the point of a nearby landmark or add some random noise to the location coordinates.

Many of my public photo sharing post containing the location of near where the photos where taken. Diaspora* posts can contain a location tag as well, but it does not seem to be very informative and the diaspy API currently does not support adding post a post location.

Instead we can process the location information contained in the post takeout JSON files and transform it to extract some information which we can use to format the new posts.

In particular, we want to include a location link to the corresponding location on Openstreetmap as well as generate some additional hashtags from the location information, e.g. which country or city the post relates to.

Using the longitude & latitude coordinates from the location info, we can directly link to the corresponding location for example on Openstreetmap or other online mapping services.

"location": {
    "latitude": 40.7414688,
    "longitude": -74.0033873,
    "displayName": "111 8th Ave",
    "physicalAddress": "111 8th Ave, New York, NY 10011, USA"
  }

In order to extract hierarchical location information like the country or the city of the location, we are calling the reverse-geocoding API of Openstreetmap with the coordinates to find the nearest recorded address of that point. To simply calling the web-api, we can use the geopy library (install for example with pip install geopy).

From various components of the address, we can generate location hashtags that help define the context of the post. The use of the additional pycountry module which contains a library of canonical country names by ISO-3166 country-codes is entirely optional but helps to create a more consistent label.

For the location record above, we can generate the following additional content snippets:

#US #UnitedStates #NYC


#!/usr/bin/env python

import codecs
import geopy.geocoders
import json
import pycountry
import sys

geocoder = geopy.geocoders.Nominatim(user_agent='gplus_migration', timeout=None)

def get_location_hashtags(loc):
  hashtags = []
  if 'latitude' in loc and 'longitude' in loc:
    addr = geocoder.reverse((loc['latitude'], loc['longitude'])).raw
    if 'address' in addr:
      addr = addr['address']
      cc = addr['country_code'].upper()
      hashtags.append(cc)
      hashtags.append(pycountry.countries.get(alpha_2=cc).name.replace(' ',''))
      for location in ['city', 'town', 'village']:
        if location in addr:
          hashtags.append(addr[location].replace(' ', ''))
  return hashtags

def get_location_link(loc):
  if 'latitude' in loc and 'longitude' in loc and 'displayName' in loc:
    map_url = ('https://www.openstreetmap.org/?lat=%s&lon=%s&zoom=17' % (loc['latitude'], loc['longitude']))
    return '[%s](%s)' % (loc['displayName'], map_url)

sys.stdout = codecs.getwriter('utf8')(sys.stdout)

for filename in sys.stdin.readlines():
  filename = filename.strip()
  post = json.load(open(filename))  
  if 'location' in post:
    print(' '.join(('#' + tag for tag in get_location_hashtags(post['location']))))
    print(get_location_link(post['location']))

Tuesday, November 27, 2018

Google+ Migration - Part V: Image Attachments

< - Part IV: Visibility Scope & Filtering

Google+ has always been rather good at dealing with photos - the photo functions were built on the foundation of Picasa and later spun out as Google Photos. Not surprising that the platform was popular with photographers and many posts contain photos.
In the takeout archive, photos or images/media file attachments to posts are rather challenging. In addition to the .json files containing each of the posts, the Takeout/Google+ Stream/Posts directory also includes two files for each image attached to a post. The basename is the originally uploaded filename, with a .jpg extension for the image file itself and a jpg.metadata.csv for for some additional information about the image.

If we originally attached an image cat.jpg to a post, there should now be a cat.jpg and cat.jpg.metadata.csv file in the post directory. However if over the years, we have been unimaginative in naming files and uploaded several cat.jpg images, there is now a name-clash that is resolved by the takeout archive by arbitrarily naming the files cat.jpg, cat(1).jpg, cat(2).jpg and so one.

The main challenge for reconstituting posts is to identify which image files is being references from which post.  The section of the JSON object which describes an image attachment looks like this example below. There is no explicit reference to the image filename in the archive nor does the metadata file contain the resourceName indicated here. There is a URL in the metadata file as well, but unfortunately it does not seem to match. The only heuristic left to try would be to take the last part of the URL path as an indication of the original filename and try to find a file with the same name. However this runs into the issue above with filename de-duplication where possibly the wrong photo would be linked to a post. For users with a combination of public and private post, such mixups could lead to very unintended consequences.


"media": {
      "url": "https://lh3.googleusercontent.com/-_liTfYo1Wys/W9SR4loPEyI/AAAAAAACBxA/wD82E3TKRdYBfEXwkExPkUOj0MY5lKCKQCJoC/w900-h1075/cat.jpg",
      "contentType": "image/*",
      "width": 900,
      "height": 1075,
      "resourceName": "media/CixBRjFRaXBPQ21aY2tlQ3h1OFVpamZJMDNpa0lqa1BsSmZ3b1ZNOWRvZlp2Qg\u003d\u003d"
    }

It appears that at in this time, we are unable to reliably reconstruct the post to image file reference reliably from the contents of archive. The alternative is to download each of the URLs referenced in the post data from the Google static content server for as long as these resources are still available.

Fortunately with the given URLs this is rather simple to do in Python. We can process the JSON files once again, find all the image references and download the images to a local cache where they are stored with filenames derived from the (presumably) unique resource names. For further re-formatting of the posts, we can then refer to the downloaded images by their new unique names.

We can use the filter command from the previous blog-post to select which post we are interested in (again all public posts in this case) and pipe the output into this scrip to build up the image cache:

ls ~/Desktop/Takeout/Google+\ Stream/Posts/*.json | ./post_filter.py --public --id communities/113390432655174294208 --id communities/103604153020461235235 --id communities/112164273001338979772 | ./image_cache.py --image-dir=./images


#!/usr/bin/env python

import argparse
import codecs
import json
import os
import sys
import urllib
import urlparse
import sys

def get_image_name(resource_name):
  return resource_name.replace('media/', '', 1) + '.jpg'

def process_image(media, image_dir):
  url = media['url']
  id = media['resourceName']
  if media['contentType'] != 'image/*':
    return
  if not url.startswith('http'): # patch for broken URLs...
    url = 'https:' + url
  target_name = os.path.join(image_dir, get_image_name(id))

  if os.path.exists(target_name):
    sys.stdout.write('.')
    sys.stdout.flush()
  else:
    print('retrieving %s as %s' % (url, target_name))
    urllib.urlretrieve(url, target_name)

# --------------------
parser = argparse.ArgumentParser(description='Collect post images referenced from a set of posts')
parser.add_argument('--image-dir', dest='image_dir', action='store', required=True)
args = parser.parse_args()

if not os.path.exists(args.image_dir):
  os.makedirs(args.image_dir)

sys.stdout = codecs.getwriter('utf8')(sys.stdout)

for filename in sys.stdin.readlines():
  filename = filename.strip()
  post = json.load(open(filename))  
  if 'media' in post:
    process_image(post['media'], args.image_dir)
  elif 'album' in post:
    for image in post['album']['media']:
      process_image(image, args.image_dir)