Sunday, November 18, 2018

Google+ Migration - Part IV: Visibility Scope & Filtering

<- Part III: Content Transformation

Circles and with them the ability to share different content with different sets of people was one of the big differentiators of Google+ over other platforms at the time, which typically had a fixed sharing model and visibility scope.

Circles were based on the observation that most people in real life interact with several "social circles" and often would not want these circles to mix. The idea of Google+ was that it should be possible to manage all these different circles under a single online identity (which should also match the "real name" identity of our governments civil registry).

It turns out that while the observation of disjoint social circles was correct, most users prefer to use different platform and online identities to manage to make sure they don't inadvertently mix. Google+ tried hard to make sharing scopes obvious and unsurprising, but the model remained complex, hard to understand and accidents were only ever one or two mouse-clicks away.

Nevertheless, many takeout archives may contain posts that were intended for very different audiences and have different visibility that may still matter deeply to users. We are presenting here a tool that could help to analyze the sharing scopes that are present in a takeout archive and partition its content by selecting any subset of them.

The access control section (ACL) of each post has grown even more complex over time with the introduction of communities and collections. In particular there seem to be following distinct ways of defining the visibility of a post (some of which can be combined):
  • Public
  • Shared with all my circles
  • Shared with my extended circles (user in all my circles and their circles, presumably)
  • Shared with a particular circle
  • Shared with a particular user
  • Part of a collection (private or public)
  • Part of a community (closed or public)
Since my archive does not contain all these combinations, the code for processing the JSON definition of the post sharing and visibility scope is based on the following inferred schema definition. Please report if you encounter any exception from this structure.

After saving the following Python code in a file, e.g. post_filter.py and making it executable (chmod +x post_filter.py) we can start by analyzing the existing visibility scopes that exist in a list of post archive files:

$ ls ~/Desktop/Takeout/Google+\ Stream/Posts/*.json | ./post_filter.py
1249 - PUBLIC 
227 - CIRCLE (Personal): circles/117832126248716550930-4eaf56378h22b473 
26 - ALL CIRCLES 
20 - COMMUNITY (Alte Städte / Old Towns): communities/103604153020461235235 
15 - EXTENDED CIRCLES 
9 - COMMUNITY (Raspberry Pi): communities/113390432655174294208 
1 - COMMUNITY (Google+ Mass Migration): communities/112164273001338979772 
1 - COMMUNITY (Free Open Source Software in Schools): communities/100565566447202592471

For my own purposes, I would consider all public posts as well as posts to public communities as essentially public and any posts that were restricted to any circles as essentially private. By carefully copying the community IDs from the output above, we can create the following filter condition to selection only the filenames of these essentially public posts from the archive:

ls ~/Desktop/Takeout/Google+\ Stream/Posts/*.json | ./post_filter.py --public --id communities/113390432655174294208 --id communities/103604153020461235235 --id communities/112164273001338979772

We can then use the resulting list of filenames to only process post which are meant to be public. In a similar way, we could also extract posts that were shared with a particular circle or community, e.g. to assist in building a joint post archive for a particular community across its members.

#!/usr/bin/env python

import argparse
import codecs
import json
import optparse
import sys

class Visibility:
  PUBLIC = 'PUBLIC'
  CIRCLES = 'ALL CIRCLES'
  EXTENDED = 'EXTENDED CIRCLES'
  CIRCLE = 'CIRCLE'
  COLLECTION = 'COLLECTION'
  COMMUNITY = 'COMMUNITY'
  USER = 'USER'
  EVENT = 'EVENT'

def parse_acl(acl):
  result = []

  # Post is public or has a visiblility defined by circles and/or users.
  if 'visibleToStandardAcl' in acl:
    if 'circles' in acl['visibleToStandardAcl']:
      for circle in acl['visibleToStandardAcl']['circles']:
        if circle['type'] == 'CIRCLE_TYPE_PUBLIC':
          result.append((Visibility.PUBLIC, None, None))
        elif circle['type'] == 'CIRCLE_TYPE_YOUR_CIRCLES':
          result.append((Visibility.CIRCLES, None, None))
        elif circle['type'] == 'CIRCLE_TYPE_EXTENDED_CIRCLES':
          result.append((Visibility.EXTENDED, None, None))
        elif circle['type'] == 'CIRCLE_TYPE_USER_CIRCLE':
          result.append((Visibility.CIRCLE, circle['resourceName'], circle.get('displayName', '')))
    if 'users' in acl['visibleToStandardAcl']:
      for user in acl['visibleToStandardAcl']['users']:
        result.append((Visibility.USER, user['resourceName'], user.get('displayName', '-')))

  # Post is part of a collection (could be public or private).
  if 'collectionAcl' in acl:
    collection = acl['collectionAcl']['collection']
    result.append((Visibility.COLLECTION, collection['resourceName'], collection.get('displayName', '-')))

  # Post is part of a community (could be public or closed).
  if 'communityAcl' in acl:
    community = acl['communityAcl']['community']
    result.append((Visibility.COMMUNITY, community['resourceName'], community.get('displayName', '-')))
    if 'users' in acl['communityAcl']:
      for user in acl['communityAcl']['users']:
        result.append((Visibility.USER, user['resourceName'], user.get('displayName', '-')))

  # Post is part of an event.
  if 'eventAcl' in acl:
    event = acl['eventAcl']['event']
    result.append((Visibility.EVENT, event['resourceName'], user.get('displayName', '-')))

  return result


#---------------------------------------------------------
parser = argparse.ArgumentParser(description='Filter G+ post JSON file by visibility')
parser.add_argument('--public', dest='scopes', action='append_const', const=Visibility.PUBLIC) 
parser.add_argument('--circles', dest='scopes',action='append_const', const=Visibility.CIRCLES)
parser.add_argument('--ext-circles', dest='scopes',action='append_const', const=Visibility.EXTENDED)
parser.add_argument('--id', dest='scopes',action='append')

args = parser.parse_args()
scopes = frozenset(args.scopes) if args.scopes != None else frozenset()

stats = {}
for filename in sys.stdin.readlines():
  filename = filename.strip()
  post = json.load(open(filename))  
  acls = parse_acl(post['postAcl'])
  for acl in acls:
    if len(scopes) == 0:
      stats[acl] = stats.get(acl, 0) + 1
    else:
      if acl[0] in (Visibility.PUBLIC, Visibility.CIRCLES, Visibility.EXTENDED) and acl[0] in scopes:
        print (filename)
      elif acl[1] in scopes:
        print (filename)
          
if len(scopes) == 0:
  sys.stdout = codecs.getwriter('utf8')(sys.stdout)
  for item in sorted(stats.items(), reverse=True, key=lambda x: x[1]):
    if item[0][0] in (Visibility.PUBLIC, Visibility.CIRCLES, Visibility.EXTENDED):
      print ('%d - %s' % (item[1], item[0][0]))
    else:
      print ('%d - %s (%s):\t %s' % (item[1], item[0][0], item[0][2], item[0][1]))




Sunday, November 11, 2018

Google+ Migration - Part III: Content Transformation

<- Part II: Understanding the takeout archive 

After we have had a look at the structure of the takeout archive, we can build some scripts to translate the content of the JSON post description into a format that is suitable for import into the target system, which in our case is Diaspora*.

The following script is a proof of concept conversion of a single post file from the takeout archive to text string that is suitable for upload to a Diaspora* server using the diaspy API.

Images are more challenging and will be handled separately in a later episode. There is also no verification on whether the original post had public visibility and should be re-posted publicly.

The main focus is on the parse_post and format_post methods. The purpose of the parse_post method is to extract the desired information from the JSON representation of a post, while the format_post method uses this data to format the input text needed to create a more or less equivalent post.

While the the post content text in the google+ takeout archive is formatted in pseudo-HTML, Diaspora* post are formatted in Markdown. In order to convert the HTML input to Markdown output, we can use the html2text Python library.

Given the difference in formatting and conventions, there is really no right or wrong way to reformat each post, but a matter of choice.

The choices made here are:

  • If the original post contained text, the text is included at the top of the post with minimal formatting and any URL links stripped out. Google+ posts may include +<username> reference which may look odd. Hashtags should be automatically re-hashtagified on the new system, as long as it uses the hashtag convention.
  • The post includes a series of static hashtags which identify it as a archived, re-posted from G+. Additional hashtags can be generated during the parsing process, e.g. to identify re-shares of photos
  • The original post date and optional community or collection names are included with each post, as we intend to make it obvious that this is a re-posted archive and not a transparent migration.
  • Link attachments are added at the end and should be rendered as a proper link attachment with preview snipped and image if supported - presumably by using something like the OpenGraph markup annotations of the linked page.
  • Deliberately not include any data which results from post activity by other users including likes or re-shares. The only exception is that if a re-shared post includes and external link, this link is included in the post with a "hat tip" to the original poster using their G+ display name at the time of export.

The functionality to post to Diaspora* is included at this time merely as a demonstration that this can indeed work and is not intended to be used without additional operational safeguards.

#!/usr/bin/env python

import datetime
import json
import sys

import dateutil.parser
import diaspy
import html2text

SERVER = '<your diaspora server URL>'
USERNAME = '<your diaspora username>'
PASSWORD = '<not really a good idea...>'

TOOL_NAME = 'G+ repost'
HASHTAGS = ['repost', 'gplusarchive', 'googleplus', 'gplusrefugees', 'plexodus']


def post_to_diaspora(content, filenames=[]):
  c = diaspy.connection.Connection(pod = SERVER,
                                   username = USERNAME,
                                   password = PASSWORD)
  c.login()
  stream = diaspy.streams.Stream(c)
  stream.post(content, provider_display_name = TOOL_NAME)


def format_post(content, link, hashtags, post_date, post_context):
    output = []

    if content:
        converter = html2text.HTML2Text()
        converter.ignore_links = True
        converter.body_width = 0
        output.append(converter.handle(content))
    
    if hashtags:
        output.append(' '.join(('#' + tag for tag in hashtags)))
        output.append('')

    if post_date:
        output.append('Originally posted on Google+ on %s%s' 
                      % (post_date.strftime('%a %b %d, %Y'),
                         '  ( ' + post_context + ')' if post_context else ''))
        output.append('')

    if link:
        output.append(link)

    return '\n'.join(output)


def parse_post(post_json):
    post_date = dateutil.parser.parse(post_json['creationTime'])
    content = post_json['content'] if 'content' in post_json else ''
    link = post_json['link']['url'] if 'link' in post_json else ''

    hashtags = HASHTAGS

    # TODO: Dealing with images later...
    if 'album' in post_json or 'media' in post_json:
        hashtags = hashtags + ['photo', 'photography']

    # If a shared post contains a link, extract that link
    # and give credit to original poster.
    if 'resharedPost' in post_json and 'link' in post_json['resharedPost']:
        link = post_json['resharedPost']['link']['url']
        content = content + ' - H/t to ' + post_json['resharedPost']['author']['displayName']
        hashtags.append('reshared')

    acl = post_json['postAcl']
    post_context = ''
    if 'communityAcl' in acl:
        post_context = acl['communityAcl']['community']['displayName']

    return format_post(content, link, hashtags, post_date, post_context)


# ----------------------
filename = sys.argv[1]
post_json = json.load(open(filename))
print(parse_post(post_json))

if len(sys.argv) > 2 and sys.argv[2] == 'repost':
    print ('posting to %s as %s' % (SERVER, USERNAME))
    post_to_diaspora(parse_post(post_json))


Sunday, October 28, 2018

Google+ Migration - Part II: Understanding the Takeout Archive

<- Part I: Takeout

Once we the takeout archive has been successfully generated we can download and unarchive/extract it to our local disks. At that point we should find a new directory called Takeout with the Google+ posts being located at the following directory location: Takeout/Google+ Stream/Posts.

This posts directory contains 3 types of files:
  • File containing data for each post in JSON format
  • Media files of images or videos uploaded and attached to posts, for example in JPG format
  • Metadata files for each media-file in CSV  forma with an additional extensions of .metadata.csv
The filenames are generated as part of the takeout archive generation process with the following conventions: the post filenames are structured as a date in YYYYMMDD format followed by a snippet of of the post text or the word "Post" if there is not text. The media filenames seem to be close to the original names of the files when they were uploaded.

Whenever a filename is not unique, an additional count is added like in these examples:

20180808 - Post(1).json
20180808 - Post(2).json
20180808 - Post(3).json
20180808 - Post.json
cake.jpg
cake(1).jpg


Filenames which contain unicode characters that are not in the base ASCII may not be correctly represented on all platforms and in particular appear corrupted in the .tgz archive. For the cases which I have been able to test, the default .zip encoded archive seems to handle unicode filenames correctly.

Each of the .json post files contains a JSON object with different named sub-objects which can themselves again be objects, lists of objects or elementary types like strings or numbers.

Based on the data which I have been able to analyze from my post archive, the post JSON object contains the following relevant sub-objects:
  • author - information about the user who created the post. In a personal takeout archive, this is always the same user.
  • creationTime and updateTime - timestamp of when post was originally created or last updated, respectively
  • content - text of the post in HTML like formatting
  • link, album, media or resharedPost etc. - post attachments of a certain type
  • location - location tag associated with post
  • plusOnes - record of users who have "plussed" the post
  • reshares - records of users who have shared the post
  • comments - record of comments, including comment author info as well as comment content
  • resourceName - unique ID of the post (also available for users, media and other objects)
  • postAcl - visibility of Post - e.g. public, part of a community or visible only to some circles or users.
In particular this list is missing the representation for collections or other post attachments like pools or events, as there are no examples for this among my posts.

An example JSON for a very simple post consisting of a single unformatted text line, an attached photo and a location tag - without any recorded post interactions:

{
  "url": "https://plus.google.com/+BernhardSuter/posts/hWpzTm3uYe3",
  "creationTime": "2018-07-15 20:43:51+0000",
  "updateTime": "2018-07-15 20:43:51+0000",
  "author": {
   ... 
  },
  "content": "1 WTC",
  "media": {
    "url": "https://lh3.googleusercontent.com/-IkMqxKEbkxs/W0uyBzAdMII/AAAAAAACAUs/uA8EmZCOdZkKGH5PN5Ct_Xj4oaY2ZNX3ACJoC/
                  w2838-h3785/gplus4828396160699980513.jpg",
    "contentType": "image/*",
    "width": 2838,
    "height": 3785,
    "description": "1 WTC",
    "resourceName": "media/CixBRjFRaXBNRUpjaXh6QTRyckdjNE5Nbmx5blVwTTBjd2lIblh3VWFCek0zMA\u003d\u003d"
  },
  "location": {
    "latitude": 40.7331168,
    "longitude": -74.0108977,
    "displayName": "Hudson River Park Trust",
    "physicalAddress": "353 West St, New York, NY 10011, USA"
  },
  "resourceName": "users/117832126248716550930/posts/UgjyOA5tNBvbgHgCoAEC",
  "postAcl": {
    "visibleToStandardAcl": {
      "circles": [{
        "type": "CIRCLE_TYPE_PUBLIC"
      }]
    }
  }
}

JSON is a simple standard data format that can easily be processed programmatically and many supporting libraries already exist. The Python standard library contains a module to parse JSON files and expose the data as native Python data objects to the code for further inspection and processing.

For example this simple Python program below can be used to determine whether a post has public visibility or not:

#!/usr/bin/python

import json
import sys

def is_public(acl):
  """Return True, if access control object contains the PUBLIC pseudo-circle."""
  if ('visibleToStandardAcl' in acl
      and 'circles' in acl['visibleToStandardAcl']):
    for circle in acl['visibleToStandardAcl']['circles']:
      if circle['type'] == 'CIRCLE_TYPE_PUBLIC':
        return True
  return False

# filter out only the posts which have public visibility.
for filename in sys.argv[1:]:
  post = json.load(open(filename))
  if is_public(post['postAcl']):
    print (filename)


Running this as ./public_posts.py ~/Download/Takeout/Google+\ Stream/Posts/*.json would return the list of filenames which contain the publicly visible posts only. By successfully parsing all the .json files in the (i.e. without trowing any errors), we can also convince ourselves that the archive contains data in syntactically valid JSON format.

Sunday, October 21, 2018

Google+ Migration - Part I: Takeout



For the last 7 years, I have been using Google+ as my primary social sharing site - with
automated link-sharing to Twitter. With Google+ going away, I am looking to migrate my public postings to a new site, where they can be presented in a similar way. As the target for the migration, I have chosen a local community-operated pod of the diaspora* network.

Migrating social media-data is particularly challenging. They are by definition an amalgamation of data from different sources: links, re-sharing, likes comments etc. - all potentially created by different users of the original social sharing platform. Also contrary to other data-sets (e.g. contact-lists, calendars or spreadsheets), there are no established, standardized data formats for exchanging social networking site activity in a platform independent way.

Without being an expert in copyright and data protection law, I am taking a very conservative approach to ownership and consent. Users of the original Google+ site were explicitly ok with the following cross-user interactions from the perspective of my post-stream:

  • re-sharing post of other users (while respecting the original scope)
  • other users liking ("plussing") my posts
  • other users commenting on my posts

Since none of these other users have ever granted my an explicit permission to replicate this content in a new form on another platform, I will only replicate my original content without any interactions, but in addition to public posts also include posts to communities, which I consider public. The purpose of this tutorial is to present some tools and methods that could be used to process and select data in a different way to implement a different policy.

For the technicalities of migration, I am making the following assumptions assumptions:

  • as input, only rely on data that is contained in the takeout-archive. This way the migration could be repeated after the original Google+ site is now longer accessible.
  • use the Python programming language for parsing, processing and re-formatting of the data.
  • use a bot with a Python API-library for diaspora* to repost (very slowly!) to the new target system.
While Python is highly portable, any examples and instructions in this tutorial will assume a unix-like operating system an be tested in particular on a current Debian GNU/Linux based system.

Ordering Takeout

For over 10 years, a Google team calling itself the Data Liberation Front, has been working on a promise that users should be able to efficiently extract any of the data they create online with Google services and take them elsewhere. The resulting service is takeout.google.com.

In order to get an archive suitable for processing, we need to request a takeout archive of the Google+ stream data in JSON format. Here are some basic instructions on how to request a new takeout archive.

For the purpose of this migration, we only need to select "Google+ Stream" in data selection. However, we need to open the extension panel and select JSON format instead of the default HTML. While the HTML export only contains the information necessary to display each post, the JSON export contains additional meta-data like access-rights in an easily machine readable format.

Given the high load on the service right now, archive creation for large streams can take a while or be incomplete. We should expect this process to become more reliable again in the next few weeks.


The next step will be to understand the structure of the data in the takeout archive.

Sunday, September 3, 2017

The Internship

During the summer month, our offices are buzzing with young, enthusiastic people from all over the world - a sign that it's intern season.

Internships are the closest that academic professions have to the apprenticeship model still common in Germanic countries. Students get to experience professional life for a few months during semester breaks and learn some practical skills that might improve their perspectives of employment, while employers get to build relationships with some of the most promising students before they officially enter the job market.

Some employers complain that students don't leave university with the exact skillset that they are currently looking for in their entry level applicants. However the most important skills a good university should teach are the ability to reason, to learn and to understand the underlying scientific foundations of a given field. Many of the practical skills needed to excel in a certain profession are best acquired on the job.

Some traditional academic professions like medicine or law have explicit and formal post-graduate training requirements before somebody is allowed to independently practice. For other professions such post-graduate on the job training may be voluntary and informal but no less important for the solid mastery of a given given profession.

The internship programs offered by most top tier tech companies are an important step in that direction. Most Internship programs offer the ability to try out a particular field, industry or employer for a limited time (typically 3-6 months), working under the close guidance and mentorship of a seasoned professional.

For students, internships are a great way to figure out what they want to do after they graduate, get a foot in the door with a potential employer or live and work for a few month in an different or exotic place, all expenses payed.

And in the end, internships are a great way to smooth the transition from university to professional life as many surveys show internships as the leading source for landing the first job after graduation.

I started my career 25 years ago with an internship opportunity at AT&T Bell Labs, the legendary research lab where many important invention had been made and many of my most admired professional celebrities and role models had worked or where still working. Being overly pragmatic and down to earth, I would never had considered applying for jobs overseas or in such an illustrious institution - but for a 6 month internship certainly why not! The 6 month internship led to a full-time job at Bell Labs over a decade of working in the US.


Sunday, July 30, 2017

Startup Scene: Why are there no Unicorns in Switzerland?

Since the term was coined a few years ago, unicorns have become the mythical creature of the venture capital industry: privately held (tech) startups with a valuation of more than a billion dollars.

What is a Startup?

While indeed extremely rare (about 200 globally), the concept of unicorns helps to clarify what people instinctively mean when they say "Startup" - specially when used as an anglicism in other languages.

The most literal definition is a new company. But for that matter, most new companies are restaurants, gas stations and other small business. Or maybe being high-risk? By that definition restaurants qualify as well as many fail within a year. What about being innovative? Most successful innovation is created by large established organizations which have large R&D budgets and who can often attract the top talent in a given field.

My favorite definition of a "Startup" is the one by Steve Blank:
A startup is an organization formed to search for a repeatable and scalable business model.
This definition emphasizes on the potential for significant and potentially very rapid growth, which also implicitly requires to take aim at a significantly large market which would allow for such growth. Venture capital investors typically look for opportunities to make a 20-30 times return on their investment in less than 10 years to compensate for the failed ventures in their investment portfolio. Hence most venture capital backed companies have at least theoretically a big growth potential.

Swiss Startup Landscape

With Switzerland being at the top of the UN Global Innovation Index for the last 7 years is home to global financial market and some of the worlds leading universities, some wonder why its startup scene does not rival other global hot-spots in fame or fortune?

Critics of the Swiss Startup landscape often point to an extreme form of European culture which does not value risk taking nor forgives failure, a safe, comfortable and expensive life with plenty of attractive employment opportunities, an overly aggressive wealth taxation or a lack of B+ round growth capital from venture capital funds.

While there may be truth to all of this, the Swiss startup scene might also be for better or for worse an image of the overall Swiss economy which mostly excels at being highly specialized global niche-players. The classic success story is a company that is export oriented, with a high-margin/high-value add product that is prohibitively hard to replicate at the same level of quality. Often a small to mid-sized company can be world-market leader in their specific narrow domain.  Hyper-scaling and reaching for the stars is rarely part of that DNA.

An overvalued currency, high cost structure and small domestic market renders almost any other activity noncompetitive at a global scale - a strange twist on the Dutch disease.

For a mainstream consumer product, the Swiss domestic market is barely the size of New York City or the San Francisco Bay Area and fragmented into several language regions and 26 very much sovereign jurisdictions. For somebody growing up thinking that Zürich or Genève is a large city, it is maybe possible to intellectually rationalize a market of a billion consumers, but intuitively grasping what this could mean is a whole different story.

With many founders, employees, experts, advisers, mentors, role-models or investors coming from a business culture which values focused excellence and reliability, Swiss startups are more likely to tackle hard science-y problems with a highly specialized and limited application domain rather than chasing after the mainstream consumer zeitgeist. They are also more likely going to delivery on what they promise.

(image: wikipedia)

Observers of the Swiss startup scene may have to accept that startups around here are a bit sturdier, a bit more down to earth and bit less mythical than they might like. But instead of trying to copy Silicon Valley culture to the letter, it might be helpful to consider what particular environmental advantages Swiss startups have at their disposal, that they can leverage into their own version of success - even if that doesn't involve a billion dollar valuation.

And despite all this, there are - at least according to some lists - currently two members in the global unicorn club of about 200. And pro-rated to an eight million population that's quite a bit above average.


Sunday, May 15, 2016

When you come to a fork() in the code, take it!

Linux is a multi-user, multi-tasking based system, which means that even a computer as small as the Raspberry Pi, can be used by multiple users simultaneously and there can be multiple processes executing (seemingly) all at once. For example, here are all the processes currently running for the user pi:

pi@raspberrypi ~ $ ps -fu pi
UID        PID  PPID  C STIME TTY          TIME CMD
pi        4792  4785  0 Mar11 ?        00:00:04 sshd: pi@pts/0   
pi        4793  4792  0 Mar11 pts/0    00:00:04 -bash
pi        6137  6130  0 00:30 ?        00:00:00 sshd: pi@pts/1   
pi        6138  6137  1 00:30 pts/1    00:00:01 -bash
pi        6185  4793  0 00:32 pts/0    00:00:00 tail -f /var/log/messages
pi        6186  6138  0 00:32 pts/1    00:00:00 ps -fu pi

Using a time sharing CPU scheduler and virtual memory, each process on Linux is led to believe that it has the whole computer all to itself, even if in reality the Linux operating system kernel is busy managing resources in the background to maintain this illusion.

Processes are among the most important concepts in Linux. A process is essentially a container for volatile resources like memory, network connection, open file handles etc. and is also associated with at least one thread of program execution. Much of the robustness of Linux is thanks to the containment and isolation which processes provide: when a program crashes only its process is terminated and cleaned up, and it doesn’t bring down the whole system.

Process Management

But how do we create such a process? Well, technically we don’t - we fork()it. Which means that a new process appears from an existing process making an replica of itself, using the fork() system call. After the fork, the user-space state of both processes is identical, except for the return value of fork, which indicates if a process is the original or the copy, which are called parent and child process respectively.

If we have a look at the following example program, fork.c :

#include <stdio.h>
#include <unistd.h>

int main()
{
  int x = 42;

  switch (fork()) {
  case -1:
    perror("fork failed");
    return 1;
    break;
  case 0:
    x = 123;
    printf("this is a new child process:\n");
    printf("  pid=%d, value of x=%d @ memory address 0x%lx\n\n"
, getpid(), x, &x);
    break;
  default:
    sleep(2);
    printf("this is the original parent process:\n");
    printf("  pid=%d, value of x=%d @ memory address 0x%lx\n",
getpid(), x, &x);
    break;
  }
  return 0;
}

Which we can compile with gcc -o fork fork.c and get the following execution:

 pi@raspberrypi ~ $ ./fork
this is a new child process:
  pid=6103, value of x=123 @ memory address 0xbee006d4

this is the original parent process:
  pid=6102, value of x=42 @ memory address 0xbee006d4
pi@raspberrypi ~ $ 

What we can see is that 2 different branches of the switch statement have been executed, but each in its own process. The parent process has entered the fork call, but two of them have returned from it. Based on the return code of fork(), they can self-identify themselves as either the original parent process or a new child copy of it and take different actions based on that.

We can also see that the variable x, which existed before the fork() in the parent now exists in both processes, even at exactly the same address location in memory!  But changes to the variable in one process is not reflected in the other one - even though they appear to share the same memory, they are in fact separate and isolated from each other.

The example below shows the “family tree” of all the processes for user pi at this moment:

pi@raspberrypi ~ $ ps fx
  PID TTY      STAT   TIME COMMAND
 7983 ?        S      0:00 sshd: pi@pts/1   
 7984 pts/1    Ss     0:01  \_ -bash
 8044 pts/1    R+     0:00      \_ ps fx
 7961 ?        S      0:00 sshd: pi@pts/0   
 7962 pts/0    Ss     0:01  \_ -bash
 8042 pts/0    S+     0:00      \_ ./fork
 8043 pts/0    Z+     0:00          \_ [fork] <defunct>

We can see the 2 processes from the fork example with the child having already exited and being in “zombie” state, waiting for its return code to be collected by a parent. The parent of our fork-parent is a bash shell (see previous tutorial). In fact, bash runs other programs by forking itself and then replacing the executable image of the child with the new command (using the exec() system call). Some processes are attached to a terminal for an interactive user session, still named TTY from the days, when most terminal session were teletype printer terminals. Some like the sshd processes are background processes, also called servers or daemon.

CPU Time-sharing

We can also see that only one process is ready to run right now - the ps tool itself. All others are sleeping and waiting for some sort of event, for example user input, a timeout or some system resource to become available. Many processes on Linux spend the vast majority of their time waiting for something without using any CPU resources.

pi@raspberrypi ~ $ ps fx
  PID TTY      STAT   TIME COMMAND
  7961 ?        S      0:00 sshd: pi@pts/0   
 7962 pts/0    Ss     0:02  \_ -bash
 8170 pts/0    R+     0:12      \_ yes
 8171 pts/0    R+     0:13      \_ gzip

The above is a nonsensical example of  a CPU intensive job by running yes | gzip > /dev/null . In this case, there are now 2 processes actively competing for the CPU, which means that the Linux kernel will alternately let them execute for a bit before interrupting them and allow some other active process to take a turn.

For a more dynamic view of the process state, we can also use the top command, which while running periodically queries the state of all processes and ranks them by top CPU usage or some other metric:

top - 22:29:59 up 18 days,  8:12,  2 users,  load average: 1.70, 1.46, 0.91
Tasks:  77 total,   2 running,  75 sleeping,   0 stopped,   0 zombie
%Cpu(s): 91.6 us,  8.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
KiB Mem:    220592 total,   202920 used,    17672 free,    24444 buffers
KiB Swap:   102396 total,       48 used,   102348 free,    95828 cached

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND 
 8171 pi        20   0  2244  816  404 R  52.0  0.4   6:49.68 gzip
 8170 pi        20   0  3156  496  428 S  45.9  0.2   5:58.95 yes
 8185 pi        20   0  4652 1432 1028 R   1.6  0.6   0:06.81 top
 7983 pi        20   0  9852 1636  996 S   0.6  0.7   0:02.77 sshd
    1 root      20   0  1840  668  572 S   0.0  0.3   1:04.67 init
...                                                                                                                                        
         
There are currently 5 processes more or less active: yes & gzip doing busy work, top periodically displaying the processes state and sshd sending that output data over SSH to a remote computer.

Virtual Memory

Besides time-sharing the CPU between all the processes which compete for it, the Linux operating system kernel also manages another important resources: main memory.

As we remember from the fork example, both processes seem to access the same address in main memory, but find there different values! What seems like magic is the concept of virtual memory, a crucial component of a multi-process system.

With the help of the Memory Management Unit (MMU), a special component in the CPU hardware, the operating system maps a virtual address space for each process to the real available memory and creating the illusion that each process has  4 gigabytes of memory (the full range of a 32bit address) at its disposal, when in reality the entire Raspberry Pi only has 512 megabytes of physical main memory. Given there were 77 processes in our system, how can 77 times 4 gigabytes add up to 512 megabytes? The trick is, does memory really have to be there if nobody is accessing it?

The system partitions the 4GB addressable memory space into thousands of small segments, called pages. When a process tries to access a particular address, the hardware intercepts the access and lets the OS intervene and quickly put some real memory there, if there isn’t already. This procedure is called a page fault. Depending on what is supposed to be on this page, the operating system has a few options on how to do this. If this page is supposed to be part of the executable binary stored on disk, then the OS can simple get an empty page of memory from its pool and fill it with the corresponding data from disk. If the process needs more memory for its dynamic data (e.g. for the heap or stack of the executing program), it just get an empty page. Things get more tricky when the operating system runs out of empty pages. In this case it will try to take away some rarely used ones from another process - if they were mapped from a file, it can simply throw away the data as it already exists on disk anyway, if it was dynamic data, it has to write the data to a special file, which is called the system swap-file, used for swapping data in and out of main memory.

Swapping is a last resort and often degrades the performance of a system beyond being useful, as disk is so much slower than main memory. But it prevents the system from crashing allows the administrator to somehow reduce the load.

Fortunately, most processes use a lot less memory than their 4GB address space. Each process contains the static executable code and data mapped from the program file on disk, some regions where it stores its dynamic data (e.g. that variable “x”) and some space to map in shared libraries and other resources. For the rest, the address space can be as empty as outer space.

Top or ps can be used to look at the memory state of a process. In the example output of top above, we can see that gzip is currently using in some way 2’244KB of its 4GB address space. Out of which only 816KB are currently mapped into real physical memory, plus another  404KB of memory shared with other processes, e.g. for using common shared system libraries.

We can also use ps to show many possible output fields, in particular here major and minor page-faults. Major faults require loading from disk, while for minor ones the data is either volatile or still in memory (e.g. from a previous execution of the same command).

pi@raspberrypi ~ $ ps x -o vsz,rsz,%mem,%cpu,maj_flt,min_flt,cmd
   VSZ   RSZ %MEM %CPU  MAJFL  MINFL CMD
  9852   364  0.1  0.0     32    727 sshd: pi@pts/0   
  6336  1244  0.5  0.0     54  10353 -bash
  9852   356  0.1  0.0     76   1167 sshd: pi@pts/1   
  6292  1264  0.5  0.0    156  11943 -bash
  3172   500  0.2  0.1      4    315 cat /dev/random
  2244   588  0.2  0.1      2    333 gzip
  3508   796  0.3  0.1      3    400 grep --color=auto asdf
  4092   932  0.4  0.0      0    358 ps x -o vsz,rsz,%mem,%cpu,maj_flt,min_flt,cmd

If we are interested in a summary of process performance metrics of a particular executable, we can also use time (install with sudo apt-get install time). Because it is shadowed by a built-in bash function with the same name, we need to run it with its fully qualified path:

pi@raspberrypi ~ $ /usr/bin/time -v gcc -o fork fork.c
Command being timed: "gcc -o fork fork.c"
User time (seconds): 0.60
System time (seconds): 0.20
Percent of CPU this job got: 53%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.49
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 6624
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 85
Minor (reclaiming a frame) page faults: 4907
Voluntary context switches: 194
Involuntary context switches: 214
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

We can see that this command only reaches about a 50% CPU utilization due to waiting for disk I/O - partially caused by the 85 page faults requiring to read in executable code from disk. Running the same command a second time, yields a 95% CPU utilization without any page faults, as the kernel hasn’t reused the pages yet from the last time.

A Host by any other Name

In the previous two episodes about IP networking, we have seen a lot about raw addresses and port-numbers, because that is how the networking stack operates internally. But this is not how we interact with the Internet in real life. Except for trouble-shooting, we don’t typically use raw addresses and IDs but rather names. For example, instead of http://173.194.113.115:80, we would enter http://www.google.com.

In the earliest days of the Internet, people kept a list of name to IP address mappings on each computer connected to the network, similar to having each a copy of a phone book. The remnants of this file still exists today on Linux in /etc/hosts for some special local default addresses.

pi@raspberrypi ~ $ cat /etc/hosts
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback

127.0.1.1 raspberrypi

Beyond that, it is hardly used for name management except for the smallest networks with only up to a few hosts with static IP addresses.

Domain Name System (DNS)

As the early Internet grew rapidly, maintaining and distributing this static list of addresses to all hosts became too cumbersome and was replaced around 1984 with a more automated system, the Domain Name System (DNS).

There are two Linux tools commonly used to test and troubleshoot DNS issues: host and dig. They are in many ways fairly similar, with host having often a more terse and to the point output, while dig provides more options and the output of dig is closer to the internal DNS data format. For this article we will generally use host whenever possible, even though it is said, that real network administrators prefer dig.

pi@raspberrypi ~ $ host www.themagpi.com
www.themagpi.com has address 74.208.151.6
www.themagpi.com has IPv6 address 2607:f1c0:1000:3016:ca5a:fd42:5e1e:9032
www.themagpi.com mail is handled by 10 mx00.1and1.com.
www.themagpi.com mail is handled by 10 mx01.1and1.com.

DNS is essentially a hierarchical and distributed database for names, addresses and a bunch of other resources on the Internet. The DNS systems consists of a potentially replicated tree of authoritative name-servers, each of which being responsible for a particular subdomain or sub-organization of the network. Fully qualified DNS hostnames reflect that hierarchy by chaining a list of sub-names separated by dots. For examples www.themagpi.com represents a host called “www” owned by an organization with sub-domain “themagpi” within the top-level domain initially created for US commercial use.

pi@raspberrypi ~ $ dig any +nostats themagpi.com

; <<>> DiG 9.8.4-rpz2+rl005.12-P1 <<>> any +nostats themagpi.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 7925
;; flags: qr rd ra; QUERY: 1, ANSWER: 7, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;themagpi.com. IN ANY

;; ANSWER SECTION:
themagpi.com. 85673 IN MX 10 mx01.1and1.com.
themagpi.com. 85673 IN SOA ns51.1and1.com. hostmaster.1and1.com. 2014022701 28800 7200 604800 86400
themagpi.com. 85673 IN MX 10 mx00.1and1.com.
themagpi.com. 85673 IN NS ns52.1and1.com.
themagpi.com. 85673 IN NS ns51.1and1.com.
themagpi.com. 85673 IN A 74.208.151.6
themagpi.com. 85673 IN AAAA 2607:f1c0:1000:3016:ca5a:fd42:5e1e:9032

This example shows a few common DNS resource types for hosts and sub-domains: IPv4 address (A), IPv6 address (AAAA), authoritative name-server (NS), designated email exchange (MX) or zone master information (SOA).

Or for a more complicated sub-domain hierarchy with a host aptly named enlightenment at Christ Church, a constituent college of the University of Oxford, which is part of the British academic and research network under the .uk top-level domain.

pi@raspberrypi ~ $ host enlightenment.chch.ox.ac.uk
enlightenment.chch.ox.ac.uk has address 129.67.123.166
enlightenment.chch.ox.ac.uk mail is handled by 9 oxmail.ox.ac.uk.

At the root of the DSN hierarchy are a set of currently 13 root nameservers which contain information about all the top-level domains in the Internet. This authoritative master for this data is currently  operated by Internet Corporation for Assigned Names and Numbers (ICANN).

In order to look up any hostname in the DNS system, a client only needs to know the address of one or more of the root servers to start the resolution. The query starts at one of the root servers, which returns the addresses of the name servers which are in term the authoritative source of information about the next sub-domain in the name, until one is reached which finally knows the address of host we are looking for. In the case of enlightenment.chch.ox.ac.uk we need to ask 4 different servers until we finally reach the one which knows the address (SOA stands for start of authority, the identity of a new authoritative zone):

pi@raspberrypi ~ $ host -t SOA  .
. has SOA record a.root-servers.net. nstld.verisign-grs.com. 2014030701 1800 900 604800 86400
pi@raspberrypi ~ $ host -t SOA  uk
uk has SOA record ns1.nic.uk. hostmaster.nic.uk. 1394217217 7200 900 2419200 172800
pi@raspberrypi ~ $ host -t SOA  ac.uk
ac.uk has SOA record ns0.ja.net. operations.ja.net. 2014030760 28800 7200 3600000 14400
pi@raspberrypi ~ $ host -t SOA  ox.ac.uk
ox.ac.uk has SOA record nighthawk.dns.ox.ac.uk. hostmaster.ox.ac.uk. 2014030772 3600 1800 1209600 900
pi@raspberrypi ~ $ host -t SOA chch.ox.ac.uk
chch.ox.ac.uk has no SOA record

The dig command has a +trace option which allows us to find all the authoritative nameservers in the resolution path:

pi@raspberrypi ~ $ dig +trace www.themagpi.com

; <<>> DiG 9.8.4-rpz2+rl005.12-P1 <<>> +trace www.themagpi.com
;; global options: +cmd
. 3599979 IN NS j.root-servers.net.
. 3599979 IN NS b.root-servers.net.
. 3599979 IN NS m.root-servers.net.
. 3599979 IN NS e.root-servers.net.
. 3599979 IN NS g.root-servers.net.
. 3599979 IN NS h.root-servers.net.
. 3599979 IN NS c.root-servers.net.
. 3599979 IN NS i.root-servers.net.
. 3599979 IN NS l.root-servers.net.
. 3599979 IN NS k.root-servers.net.
. 3599979 IN NS a.root-servers.net.
. 3599979 IN NS d.root-servers.net.
. 3599979 IN NS f.root-servers.net.
;; Received 241 bytes from 62.2.17.60#53(62.2.17.60) in 238 ms

com. 172800 IN NS i.gtld-servers.net.
com. 172800 IN NS j.gtld-servers.net.
com. 172800 IN NS d.gtld-servers.net.
com. 172800 IN NS h.gtld-servers.net.
com. 172800 IN NS f.gtld-servers.net.
com. 172800 IN NS e.gtld-servers.net.
com. 172800 IN NS b.gtld-servers.net.
com. 172800 IN NS a.gtld-servers.net.
com. 172800 IN NS l.gtld-servers.net.
com. 172800 IN NS g.gtld-servers.net.
com. 172800 IN NS k.gtld-servers.net.
com. 172800 IN NS m.gtld-servers.net.
com. 172800 IN NS c.gtld-servers.net.
;; Received 494 bytes from 192.112.36.4#53(192.112.36.4) in 279 ms

themagpi.com. 172800 IN NS ns51.1and1.com.
themagpi.com. 172800 IN NS ns52.1and1.com.
;; Received 110 bytes from 192.5.6.30#53(192.5.6.30) in 198 ms

www.themagpi.com. 86400 IN A 74.208.151.6
;; Received 50 bytes from 217.160.81.164#53(217.160.81.164) in 37 ms

DNS resolution happens itself over UDP or TCP (port 53) and as we can imagine from the previous article, this would require quite a bit work and messages sent all around the Internet, just to find out the IP address of the host we actually want to connect to.

Fortunately this isn’t usually as complicated and expensive in real life. There are plenty of non-authoritative, caching & recursive-resolution name-servers deployed all around the edge of the Internet, which will do the work for us and remember the result for some time in case somebody asks again.

Most networking application on Linux are linked to a standard library which contains the name resolver client. This resolver will usually start by looking in the good old /etc/hosts file for a name and otherwise continue with asking name-servers in the list contained in /etc/resolv.conf.

As we can imagine, a slow or flaky name-server can severely degrade the performance of our Internet experience.  We can have a look at the time it takes to resolve certain names, and compare query times from different name-servers - .e.g. our default nameserver vs. a Google public DNS nameserver reachable at 8.8.8.8:

pi@raspberrypi ~ $ dig  +stats +noquestion +nocomment www.themagpi.com
; <<>> DiG 9.8.4-rpz2+rl005.12-P1 <<>> +stats +noquestion +nocomment www.themagpi.com
;; global options: +cmd
www.themagpi.com. 80817 IN A 74.208.151.6
;; Query time: 36 msec
;; SERVER: 62.2.17.60#53(62.2.17.60)
;; WHEN: Sat Mar  8 22:08:11 2014
;; MSG SIZE  rcvd: 50

pi@raspberrypi ~ $ dig  @8.8.8.8 +stats +noquestion +nocomment www.themagpi.com
; <<>> DiG 9.8.4-rpz2+rl005.12-P1 <<>> @8.8.8.8 +stats +noquestion +nocomment www.themagpi.com
; (1 server found)
;; global options: +cmd
www.themagpi.com. 20103 IN A 74.208.151.6
;; Query time: 27 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Sat Mar  8 22:08:51 2014
;; MSG SIZE  rcvd: 50

Dynamic Host Configuration Protocol (DHCP)

As we have seen so far, in order to properly use the Internet, we need an IP address for our local Ethernet interface, we need to know the IP address of the IP gateway to the Internet on our local LAN and we need to know the IP address of at least one name-server willing to provide name resolution.

Most of us who are using a Raspberry Pi with a standard Raspbian image have not configured all these by ourselves and probably didn’t even know what they are before we started poking around. The system which is commonly used to provide the essential configuration to hosts on a local network is called Dynamic Host Configuration Protocol (DHCP). The Ethernet interface in the standard Raspbian distribution is configured to run dhclient, a DHCP client implementation for Linux.

Whenever a host is newly connect to a network, it sends out calls for help on a well defined Ethernet broadcast address. If there is a DHCP server listening on the same network, it will respond with the necessary information about how this new host should configure its core network settings. These settings, in particular the address assignment, are only valid for a certain period of time and then need to be renewed, potentially resulting in a different configuration. In the DHCP-speak this is called a “lease”:

pi@raspberrypi ~ $ cat /var/lib/dhcp/dhclient.eth0.leases 
lease {
  interface "eth0";
  fixed-address 192.168.1.136;
  option subnet-mask 255.255.255.0;
  option routers 192.168.1.1;
  option dhcp-lease-time 86400;
  option dhcp-message-type 5;
  option domain-name-servers 62.2.17.60,62.2.24.162;
  option dhcp-server-identifier 192.168.1.1;
  option domain-name "mydomain.net";
  renew 6 2014/03/08 01:00:42;
  rebind 6 2014/03/08 10:35:34;
  expire 6 2014/03/08 13:35:34;
}

Using DHCP, a network administrator can configure an entire network through a central server instead of having to configure each host as they are connected to the network. Similar to  host and domain-names, IP addresses are managed in a distributed and hierarchical fashion, where certain network operators are assigned certain blocks of addresses, which they in turn hand out in smaller blocks to the administrators of sub-networks. Since each address must only exist once in the public Internet, address allocation requires a lot of careful planning for which protocols like DHCP can help administrators to more easily manage address at the host level.

Running a local name-server

We have seen that for a typical home network, using the default name-server of the Internet access provider can easily add 10s to 100s of milliseconds of additional latency to each connection setup.

There are many choices of DNS servers on Linux but probably the best choice for a local cache or a small local network would be dnsmasq. It is very easy to administer, has a small resource usage and can also act as a DHCP server, which makes it an easy integrated network administration tool for small networks, like a home network with just a few hosts and an Internet connection.

To configure dnsmasq as a simple local caching name-server is a simple as installing it with sudo apt-get install dnsmasq and test it:

pi@raspberrypi ~ $ dig  @localhost +stats +noquestion +nocomment www.themagpi.com
; <<>> DiG 9.8.4-rpz2+rl005.12-P1 <<>> @localhost +stats +noquestion +nocomment www.themagpi.com
; (2 servers found)
;; global options: +cmd
www.themagpi.com. 82234 IN A 74.208.151.6
;; Query time: 8 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Sat Mar  8 23:16:48 2014
;; MSG SIZE  rcvd: 50

And we get sub-10ms query times for cached addresses. In its default configuration, dnsmasq forwards all requests it has not yet cached to the default name-server configured in /etc/resolv.conf, which in our case are set by the DHCP client. We can now enable the local DNS cache to be used as the new default for the local resolver by adding the line prepend domain-name-servers 127.0.0.1 to the dhclient config file in /etc/dhcp/dhclient.conf. This will put our local server in first and default position in /etc/resolv.conf and dnsmasq is smart enough to ignore itself as a forwarder in order not to create an infinite forwarding loop.

Conclusion

As we have seen, name resolution at Internet scale requires a complex machinery which kicks into action each time we type a URL name into the browser navigation bar. The Domain Name System is a critical and sometimes political part of the Internet infrastructure. Invisible to the user, slow or flaky DNS server can severely degrade the performance we experience on the Internet. Sometimes it is not a download itself that is slow, but resolving the name of the server before the download can even start. Relying on the DNS infrastructure also requires a great deal of trust, as compromised DNS servers could easily redirect traffic to a completely different server.