Sunday, May 15, 2016

When you come to a fork() in the code, take it!

Linux is a multi-user, multi-tasking based system, which means that even a computer as small as the Raspberry Pi, can be used by multiple users simultaneously and there can be multiple processes executing (seemingly) all at once. For example, here are all the processes currently running for the user pi:

pi@raspberrypi ~ $ ps -fu pi
pi        4792  4785  0 Mar11 ?        00:00:04 sshd: pi@pts/0   
pi        4793  4792  0 Mar11 pts/0    00:00:04 -bash
pi        6137  6130  0 00:30 ?        00:00:00 sshd: pi@pts/1   
pi        6138  6137  1 00:30 pts/1    00:00:01 -bash
pi        6185  4793  0 00:32 pts/0    00:00:00 tail -f /var/log/messages
pi        6186  6138  0 00:32 pts/1    00:00:00 ps -fu pi

Using a time sharing CPU scheduler and virtual memory, each process on Linux is led to believe that it has the whole computer all to itself, even if in reality the Linux operating system kernel is busy managing resources in the background to maintain this illusion.

Processes are among the most important concepts in Linux. A process is essentially a container for volatile resources like memory, network connection, open file handles etc. and is also associated with at least one thread of program execution. Much of the robustness of Linux is thanks to the containment and isolation which processes provide: when a program crashes only its process is terminated and cleaned up, and it doesn’t bring down the whole system.

Process Management

But how do we create such a process? Well, technically we don’t - we fork()it. Which means that a new process appears from an existing process making an replica of itself, using the fork() system call. After the fork, the user-space state of both processes is identical, except for the return value of fork, which indicates if a process is the original or the copy, which are called parent and child process respectively.

If we have a look at the following example program, fork.c :

#include <stdio.h>
#include <unistd.h>

int main()
  int x = 42;

  switch (fork()) {
  case -1:
    perror("fork failed");
    return 1;
  case 0:
    x = 123;
    printf("this is a new child process:\n");
    printf("  pid=%d, value of x=%d @ memory address 0x%lx\n\n"
, getpid(), x, &x);
    printf("this is the original parent process:\n");
    printf("  pid=%d, value of x=%d @ memory address 0x%lx\n",
getpid(), x, &x);
  return 0;

Which we can compile with gcc -o fork fork.c and get the following execution:

 pi@raspberrypi ~ $ ./fork
this is a new child process:
  pid=6103, value of x=123 @ memory address 0xbee006d4

this is the original parent process:
  pid=6102, value of x=42 @ memory address 0xbee006d4
pi@raspberrypi ~ $ 

What we can see is that 2 different branches of the switch statement have been executed, but each in its own process. The parent process has entered the fork call, but two of them have returned from it. Based on the return code of fork(), they can self-identify themselves as either the original parent process or a new child copy of it and take different actions based on that.

We can also see that the variable x, which existed before the fork() in the parent now exists in both processes, even at exactly the same address location in memory!  But changes to the variable in one process is not reflected in the other one - even though they appear to share the same memory, they are in fact separate and isolated from each other.

The example below shows the “family tree” of all the processes for user pi at this moment:

pi@raspberrypi ~ $ ps fx
 7983 ?        S      0:00 sshd: pi@pts/1   
 7984 pts/1    Ss     0:01  \_ -bash
 8044 pts/1    R+     0:00      \_ ps fx
 7961 ?        S      0:00 sshd: pi@pts/0   
 7962 pts/0    Ss     0:01  \_ -bash
 8042 pts/0    S+     0:00      \_ ./fork
 8043 pts/0    Z+     0:00          \_ [fork] <defunct>

We can see the 2 processes from the fork example with the child having already exited and being in “zombie” state, waiting for its return code to be collected by a parent. The parent of our fork-parent is a bash shell (see previous tutorial). In fact, bash runs other programs by forking itself and then replacing the executable image of the child with the new command (using the exec() system call). Some processes are attached to a terminal for an interactive user session, still named TTY from the days, when most terminal session were teletype printer terminals. Some like the sshd processes are background processes, also called servers or daemon.

CPU Time-sharing

We can also see that only one process is ready to run right now - the ps tool itself. All others are sleeping and waiting for some sort of event, for example user input, a timeout or some system resource to become available. Many processes on Linux spend the vast majority of their time waiting for something without using any CPU resources.

pi@raspberrypi ~ $ ps fx
  7961 ?        S      0:00 sshd: pi@pts/0   
 7962 pts/0    Ss     0:02  \_ -bash
 8170 pts/0    R+     0:12      \_ yes
 8171 pts/0    R+     0:13      \_ gzip

The above is a nonsensical example of  a CPU intensive job by running yes | gzip > /dev/null . In this case, there are now 2 processes actively competing for the CPU, which means that the Linux kernel will alternately let them execute for a bit before interrupting them and allow some other active process to take a turn.

For a more dynamic view of the process state, we can also use the top command, which while running periodically queries the state of all processes and ranks them by top CPU usage or some other metric:

top - 22:29:59 up 18 days,  8:12,  2 users,  load average: 1.70, 1.46, 0.91
Tasks:  77 total,   2 running,  75 sleeping,   0 stopped,   0 zombie
%Cpu(s): 91.6 us,  8.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
KiB Mem:    220592 total,   202920 used,    17672 free,    24444 buffers
KiB Swap:   102396 total,       48 used,   102348 free,    95828 cached

 8171 pi        20   0  2244  816  404 R  52.0  0.4   6:49.68 gzip
 8170 pi        20   0  3156  496  428 S  45.9  0.2   5:58.95 yes
 8185 pi        20   0  4652 1432 1028 R   1.6  0.6   0:06.81 top
 7983 pi        20   0  9852 1636  996 S   0.6  0.7   0:02.77 sshd
    1 root      20   0  1840  668  572 S   0.0  0.3   1:04.67 init
There are currently 5 processes more or less active: yes & gzip doing busy work, top periodically displaying the processes state and sshd sending that output data over SSH to a remote computer.

Virtual Memory

Besides time-sharing the CPU between all the processes which compete for it, the Linux operating system kernel also manages another important resources: main memory.

As we remember from the fork example, both processes seem to access the same address in main memory, but find there different values! What seems like magic is the concept of virtual memory, a crucial component of a multi-process system.

With the help of the Memory Management Unit (MMU), a special component in the CPU hardware, the operating system maps a virtual address space for each process to the real available memory and creating the illusion that each process has  4 gigabytes of memory (the full range of a 32bit address) at its disposal, when in reality the entire Raspberry Pi only has 512 megabytes of physical main memory. Given there were 77 processes in our system, how can 77 times 4 gigabytes add up to 512 megabytes? The trick is, does memory really have to be there if nobody is accessing it?

The system partitions the 4GB addressable memory space into thousands of small segments, called pages. When a process tries to access a particular address, the hardware intercepts the access and lets the OS intervene and quickly put some real memory there, if there isn’t already. This procedure is called a page fault. Depending on what is supposed to be on this page, the operating system has a few options on how to do this. If this page is supposed to be part of the executable binary stored on disk, then the OS can simple get an empty page of memory from its pool and fill it with the corresponding data from disk. If the process needs more memory for its dynamic data (e.g. for the heap or stack of the executing program), it just get an empty page. Things get more tricky when the operating system runs out of empty pages. In this case it will try to take away some rarely used ones from another process - if they were mapped from a file, it can simply throw away the data as it already exists on disk anyway, if it was dynamic data, it has to write the data to a special file, which is called the system swap-file, used for swapping data in and out of main memory.

Swapping is a last resort and often degrades the performance of a system beyond being useful, as disk is so much slower than main memory. But it prevents the system from crashing allows the administrator to somehow reduce the load.

Fortunately, most processes use a lot less memory than their 4GB address space. Each process contains the static executable code and data mapped from the program file on disk, some regions where it stores its dynamic data (e.g. that variable “x”) and some space to map in shared libraries and other resources. For the rest, the address space can be as empty as outer space.

Top or ps can be used to look at the memory state of a process. In the example output of top above, we can see that gzip is currently using in some way 2’244KB of its 4GB address space. Out of which only 816KB are currently mapped into real physical memory, plus another  404KB of memory shared with other processes, e.g. for using common shared system libraries.

We can also use ps to show many possible output fields, in particular here major and minor page-faults. Major faults require loading from disk, while for minor ones the data is either volatile or still in memory (e.g. from a previous execution of the same command).

pi@raspberrypi ~ $ ps x -o vsz,rsz,%mem,%cpu,maj_flt,min_flt,cmd
  9852   364  0.1  0.0     32    727 sshd: pi@pts/0   
  6336  1244  0.5  0.0     54  10353 -bash
  9852   356  0.1  0.0     76   1167 sshd: pi@pts/1   
  6292  1264  0.5  0.0    156  11943 -bash
  3172   500  0.2  0.1      4    315 cat /dev/random
  2244   588  0.2  0.1      2    333 gzip
  3508   796  0.3  0.1      3    400 grep --color=auto asdf
  4092   932  0.4  0.0      0    358 ps x -o vsz,rsz,%mem,%cpu,maj_flt,min_flt,cmd

If we are interested in a summary of process performance metrics of a particular executable, we can also use time (install with sudo apt-get install time). Because it is shadowed by a built-in bash function with the same name, we need to run it with its fully qualified path:

pi@raspberrypi ~ $ /usr/bin/time -v gcc -o fork fork.c
Command being timed: "gcc -o fork fork.c"
User time (seconds): 0.60
System time (seconds): 0.20
Percent of CPU this job got: 53%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.49
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 6624
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 85
Minor (reclaiming a frame) page faults: 4907
Voluntary context switches: 194
Involuntary context switches: 214
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

We can see that this command only reaches about a 50% CPU utilization due to waiting for disk I/O - partially caused by the 85 page faults requiring to read in executable code from disk. Running the same command a second time, yields a 95% CPU utilization without any page faults, as the kernel hasn’t reused the pages yet from the last time.

A Host by any other Name

In the previous two episodes about IP networking, we have seen a lot about raw addresses and port-numbers, because that is how the networking stack operates internally. But this is not how we interact with the Internet in real life. Except for trouble-shooting, we don’t typically use raw addresses and IDs but rather names. For example, instead of, we would enter

In the earliest days of the Internet, people kept a list of name to IP address mappings on each computer connected to the network, similar to having each a copy of a phone book. The remnants of this file still exists today on Linux in /etc/hosts for some special local default addresses.

pi@raspberrypi ~ $ cat /etc/hosts localhost
::1 localhost ip6-localhost ip6-loopback raspberrypi

Beyond that, it is hardly used for name management except for the smallest networks with only up to a few hosts with static IP addresses.

Domain Name System (DNS)

As the early Internet grew rapidly, maintaining and distributing this static list of addresses to all hosts became too cumbersome and was replaced around 1984 with a more automated system, the Domain Name System (DNS).

There are two Linux tools commonly used to test and troubleshoot DNS issues: host and dig. They are in many ways fairly similar, with host having often a more terse and to the point output, while dig provides more options and the output of dig is closer to the internal DNS data format. For this article we will generally use host whenever possible, even though it is said, that real network administrators prefer dig.

pi@raspberrypi ~ $ host has address has IPv6 address 2607:f1c0:1000:3016:ca5a:fd42:5e1e:9032 mail is handled by 10 mail is handled by 10

DNS is essentially a hierarchical and distributed database for names, addresses and a bunch of other resources on the Internet. The DNS systems consists of a potentially replicated tree of authoritative name-servers, each of which being responsible for a particular subdomain or sub-organization of the network. Fully qualified DNS hostnames reflect that hierarchy by chaining a list of sub-names separated by dots. For examples represents a host called “www” owned by an organization with sub-domain “themagpi” within the top-level domain initially created for US commercial use.

pi@raspberrypi ~ $ dig any +nostats

; <<>> DiG 9.8.4-rpz2+rl005.12-P1 <<>> any +nostats
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 7925
;; flags: qr rd ra; QUERY: 1, ANSWER: 7, AUTHORITY: 0, ADDITIONAL: 0


;; ANSWER SECTION: 85673 IN MX 10 85673 IN SOA 2014022701 28800 7200 604800 86400 85673 IN MX 10 85673 IN NS 85673 IN NS 85673 IN A 85673 IN AAAA 2607:f1c0:1000:3016:ca5a:fd42:5e1e:9032

This example shows a few common DNS resource types for hosts and sub-domains: IPv4 address (A), IPv6 address (AAAA), authoritative name-server (NS), designated email exchange (MX) or zone master information (SOA).

Or for a more complicated sub-domain hierarchy with a host aptly named enlightenment at Christ Church, a constituent college of the University of Oxford, which is part of the British academic and research network under the .uk top-level domain.

pi@raspberrypi ~ $ host has address mail is handled by 9

At the root of the DSN hierarchy are a set of currently 13 root nameservers which contain information about all the top-level domains in the Internet. This authoritative master for this data is currently  operated by Internet Corporation for Assigned Names and Numbers (ICANN).

In order to look up any hostname in the DNS system, a client only needs to know the address of one or more of the root servers to start the resolution. The query starts at one of the root servers, which returns the addresses of the name servers which are in term the authoritative source of information about the next sub-domain in the name, until one is reached which finally knows the address of host we are looking for. In the case of we need to ask 4 different servers until we finally reach the one which knows the address (SOA stands for start of authority, the identity of a new authoritative zone):

pi@raspberrypi ~ $ host -t SOA  .
. has SOA record 2014030701 1800 900 604800 86400
pi@raspberrypi ~ $ host -t SOA  uk
uk has SOA record 1394217217 7200 900 2419200 172800
pi@raspberrypi ~ $ host -t SOA has SOA record 2014030760 28800 7200 3600000 14400
pi@raspberrypi ~ $ host -t SOA has SOA record 2014030772 3600 1800 1209600 900
pi@raspberrypi ~ $ host -t SOA has no SOA record

The dig command has a +trace option which allows us to find all the authoritative nameservers in the resolution path:

pi@raspberrypi ~ $ dig +trace

; <<>> DiG 9.8.4-rpz2+rl005.12-P1 <<>> +trace
;; global options: +cmd
. 3599979 IN NS
. 3599979 IN NS
. 3599979 IN NS
. 3599979 IN NS
. 3599979 IN NS
. 3599979 IN NS
. 3599979 IN NS
. 3599979 IN NS
. 3599979 IN NS
. 3599979 IN NS
. 3599979 IN NS
. 3599979 IN NS
. 3599979 IN NS
;; Received 241 bytes from in 238 ms

com. 172800 IN NS
com. 172800 IN NS
com. 172800 IN NS
com. 172800 IN NS
com. 172800 IN NS
com. 172800 IN NS
com. 172800 IN NS
com. 172800 IN NS
com. 172800 IN NS
com. 172800 IN NS
com. 172800 IN NS
com. 172800 IN NS
com. 172800 IN NS
;; Received 494 bytes from in 279 ms 172800 IN NS 172800 IN NS
;; Received 110 bytes from in 198 ms 86400 IN A
;; Received 50 bytes from in 37 ms

DNS resolution happens itself over UDP or TCP (port 53) and as we can imagine from the previous article, this would require quite a bit work and messages sent all around the Internet, just to find out the IP address of the host we actually want to connect to.

Fortunately this isn’t usually as complicated and expensive in real life. There are plenty of non-authoritative, caching & recursive-resolution name-servers deployed all around the edge of the Internet, which will do the work for us and remember the result for some time in case somebody asks again.

Most networking application on Linux are linked to a standard library which contains the name resolver client. This resolver will usually start by looking in the good old /etc/hosts file for a name and otherwise continue with asking name-servers in the list contained in /etc/resolv.conf.

As we can imagine, a slow or flaky name-server can severely degrade the performance of our Internet experience.  We can have a look at the time it takes to resolve certain names, and compare query times from different name-servers - .e.g. our default nameserver vs. a Google public DNS nameserver reachable at

pi@raspberrypi ~ $ dig  +stats +noquestion +nocomment
; <<>> DiG 9.8.4-rpz2+rl005.12-P1 <<>> +stats +noquestion +nocomment
;; global options: +cmd 80817 IN A
;; Query time: 36 msec
;; WHEN: Sat Mar  8 22:08:11 2014
;; MSG SIZE  rcvd: 50

pi@raspberrypi ~ $ dig  @ +stats +noquestion +nocomment
; <<>> DiG 9.8.4-rpz2+rl005.12-P1 <<>> @ +stats +noquestion +nocomment
; (1 server found)
;; global options: +cmd 20103 IN A
;; Query time: 27 msec
;; WHEN: Sat Mar  8 22:08:51 2014
;; MSG SIZE  rcvd: 50

Dynamic Host Configuration Protocol (DHCP)

As we have seen so far, in order to properly use the Internet, we need an IP address for our local Ethernet interface, we need to know the IP address of the IP gateway to the Internet on our local LAN and we need to know the IP address of at least one name-server willing to provide name resolution.

Most of us who are using a Raspberry Pi with a standard Raspbian image have not configured all these by ourselves and probably didn’t even know what they are before we started poking around. The system which is commonly used to provide the essential configuration to hosts on a local network is called Dynamic Host Configuration Protocol (DHCP). The Ethernet interface in the standard Raspbian distribution is configured to run dhclient, a DHCP client implementation for Linux.

Whenever a host is newly connect to a network, it sends out calls for help on a well defined Ethernet broadcast address. If there is a DHCP server listening on the same network, it will respond with the necessary information about how this new host should configure its core network settings. These settings, in particular the address assignment, are only valid for a certain period of time and then need to be renewed, potentially resulting in a different configuration. In the DHCP-speak this is called a “lease”:

pi@raspberrypi ~ $ cat /var/lib/dhcp/dhclient.eth0.leases 
lease {
  interface "eth0";
  option subnet-mask;
  option routers;
  option dhcp-lease-time 86400;
  option dhcp-message-type 5;
  option domain-name-servers,;
  option dhcp-server-identifier;
  option domain-name "";
  renew 6 2014/03/08 01:00:42;
  rebind 6 2014/03/08 10:35:34;
  expire 6 2014/03/08 13:35:34;

Using DHCP, a network administrator can configure an entire network through a central server instead of having to configure each host as they are connected to the network. Similar to  host and domain-names, IP addresses are managed in a distributed and hierarchical fashion, where certain network operators are assigned certain blocks of addresses, which they in turn hand out in smaller blocks to the administrators of sub-networks. Since each address must only exist once in the public Internet, address allocation requires a lot of careful planning for which protocols like DHCP can help administrators to more easily manage address at the host level.

Running a local name-server

We have seen that for a typical home network, using the default name-server of the Internet access provider can easily add 10s to 100s of milliseconds of additional latency to each connection setup.

There are many choices of DNS servers on Linux but probably the best choice for a local cache or a small local network would be dnsmasq. It is very easy to administer, has a small resource usage and can also act as a DHCP server, which makes it an easy integrated network administration tool for small networks, like a home network with just a few hosts and an Internet connection.

To configure dnsmasq as a simple local caching name-server is a simple as installing it with sudo apt-get install dnsmasq and test it:

pi@raspberrypi ~ $ dig  @localhost +stats +noquestion +nocomment
; <<>> DiG 9.8.4-rpz2+rl005.12-P1 <<>> @localhost +stats +noquestion +nocomment
; (2 servers found)
;; global options: +cmd 82234 IN A
;; Query time: 8 msec
;; WHEN: Sat Mar  8 23:16:48 2014
;; MSG SIZE  rcvd: 50

And we get sub-10ms query times for cached addresses. In its default configuration, dnsmasq forwards all requests it has not yet cached to the default name-server configured in /etc/resolv.conf, which in our case are set by the DHCP client. We can now enable the local DNS cache to be used as the new default for the local resolver by adding the line prepend domain-name-servers to the dhclient config file in /etc/dhcp/dhclient.conf. This will put our local server in first and default position in /etc/resolv.conf and dnsmasq is smart enough to ignore itself as a forwarder in order not to create an infinite forwarding loop.


As we have seen, name resolution at Internet scale requires a complex machinery which kicks into action each time we type a URL name into the browser navigation bar. The Domain Name System is a critical and sometimes political part of the Internet infrastructure. Invisible to the user, slow or flaky DNS server can severely degrade the performance we experience on the Internet. Sometimes it is not a download itself that is slow, but resolving the name of the server before the download can even start. Relying on the DNS infrastructure also requires a great deal of trust, as compromised DNS servers could easily redirect traffic to a completely different server.

A Matter of Protocol

Today, we are taking a look at the TCP/IP Internet protocols by using a few commands which allow us to see what is going on in the Linux kernel, all the way down to the Ethernet wire.

For example, the basic ping command we have seen in a previous post might result in the following sequence of packets sent on the Ethernet port of the Raspberry Pi:

ARP, Request who-has tell, length 28
ARP, Reply is-at c4:2c:03:1c:f2:2e, length 46
IP > ICMP echo request, id 17785, seq 1, length 64
IP > ICMP echo reply, id 17785, seq 1, length 64
IP > ICMP echo request, id 17785, seq 2, length 64
IP > ICMP echo reply, id 17785, seq 2, length 64

But before we are trying to understand what is going on here, a brief excursion into the history and theory of packet networking.

In the 1960ies, researchers were starting to think of packet switching  as a more economical and resilient alternative to connecting computers directly to each other with wires.  The idea was to break up communications into small blocks or “packets” and send them loosely and independently across a packet-switched network to a destination which then interprets them again as stream of communication. This is similar to how the postal mail system worked for centuries and one could imagine, if the postal system were fast enough, it would be possible to hold a conversation in real time by sending series of short letters back and forth really fast. In fact, nearly all telephone conversations are transmitted today in such a way over packet switched networks, without any noticeable impact to the user.

When through the 1970ies many mutually incompatible implementations of packet based computer networks started to emerge, the International Standards Organization (ISO) attempted to define an open and global architecture  that should enable a single world-wide computer network - similar to the already existing telephone system.

Over the following decades, the growing popularity of the much simpler Internet Protocol (IP) suite made this dream a reality in the form of the global Internet. Network engineers nonetheless use the comprehensive 7 layer Open System Interconnections (OSI) reference model to reason and talk about computer network architecture.

The OSI model defines a 7 layer stack of protocols, each communicating logically with peers at its layer and  each providing a specific functions:

  1. The Physical Layer deals with defining electrical signals over wires, modulations, radio-frequency spectrum, properties of lasers and fiber optic cables and so on.
  2. The Link Layer is concerned with creating a local network connection between 2 or systems using a single physical medium. The most popular families of link-layers are point-point (e.g. for fiber optic cables) and shared or multiple-access networks like Ethernet support by the Raspberry Pi model A. Many of those are defined in the 802.X family of standards issued by the Institute for Electrical and Electronics Engineers (IEEE).
  3. The purpose of the Network Layer is to provide a uniform and universal way to address and reach any connected node globally no matter where it is and what kind of (link-layer) network technology it is connected to. It fuses several heterogeneous and mutually incompatible networks into a single “meta” or “Inter”-network. The Internet Protocol (IP) proposed in 1980 for this purpose is what gave the Internet its name.
  4. The Transport Layer is responsible for getting a logical stream of data somewhat reliably across the underlying network. The most popular implementations are the Transport Control Protocol (TCP) providing a reliable, byte-stream service and User Datagram Protocol which provides a more direct access to the unreliably, packet based service of an IP network.
  5. The Session Layer could be combined with the transport layer, e.g. in the case of TCP which requires a strong session concept. Some cases of session management and multiplexing on top of a single transport stream e.g. HTTP1.1, SPDY or explicit session management and control protocols for real-time streaming like SIP or RTCP could be considered as session layer functions.
  6. The Presentation Layer deals with data representation and encoding. This could be things like ASCII or Unicode text encoding, html, xml, JSON etc.
  7. The Application Layer provides the use-case and purpose of an entire specialized protocol stack. This could for example be file-transfer (FTP or RSYNC), remote login (telnet or ssh), email, web service and many more.

In the simplified view below for the TCP/IP or Internet protocol suite, layers 2,3 & 4 are clearly separated and part of the Linux kernel itself, which the upper layers are much more fuzzy and potentially part of complex user-space applications - e.g. a web-server or browser implementation.

The trace at the beginning of the article was generated with a tool called tcpdump.
It allows to capture and decode all traffic that is sent and received by a link-layer interface in the kernel.

WARNING: Intercepting traffic from other users on shared networks might be considered unlawful in some countries or against policy in some organizations (schools, universities or companies). If this situation might apply to you, please check with your network administrator first before starting to play with tcpdump!

The tcppdump tool is typically not pre-installed in Raspbian, but we can easily get it with sudo apt-get install tcpdump.

 The above trace is then generated by running

pi@raspberrypi ~ $ sudo tcpdump -nt  host and not tcp port 22

in one window and for example
pi@raspberrypi ~ $ ping

in another. The command line options -nt filter out name resolution and timestamps respectively and just help to make the output a bit more compact for the example. tcpdump supports many types of query filters which allow to restrict what is being captured and decoded to a particular protocol type, application type or address range.

In particular when trying to run tcpdump over an ssh session on the same network we are trying to capture from, we would easily create a nasty feedback loop trying to generate an infinite amount of output. Restricting the capture to only our local interface address and NOT including any traffic on port 22 - which is the ssh protocol allows us to look at all network traffic generated by the Raspberry Pi, not including the ssh sessions.

But how do we find what our own network addresses are? For that, the ifconfig command provides a lot of useful information about layer 1-3 state of all our network interfaces:

pi@raspberrypi ~ $ ifconfig -a
eth0      Link encap:Ethernet  HWaddr b8:27:eb:13:f7:57
          inet addr:  Bcast:  Mask:
          RX packets:17038947 errors:8 dropped:7 overruns:0 frame:7
          TX packets:11258008 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:3037900297 (2.8 GiB)  TX bytes:1335045310 (1.2 GiB)

lo        Link encap:Local Loopback
          inet addr:  Mask:
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:66130 errors:0 dropped:0 overruns:0 frame:0
          TX packets:66130 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:19241026 (18.3 MiB)  TX bytes:19241026 (18.3 MiB)

From that we can learn that
we have 2 network interfaces - eth0 and lo
at the link-layer eth0 is an Ethernet interface (802.3), while lo is a virtual dummy/loopback interface
Focusing a bit more on the Ethernet interface we are interested in:
it has a HWaddr of b8:27:eb:13:f7:57 which is the MAC (Media Access Control) address used by hosts on a  IEEE 802.X multi-access networks (like Ethernet, WiFI and others) to communicate with each other on the local L2 network. This address is assigned to each 802.X compatible interface by the hardware manufacturer as a kind of globally unique serial number. The prefix B8:27:EB is assigned to the Raspberry Pi Foundation to generate all its necessary addresses for the model B Ethernet ports.
the MTU or Maximum Transfer Unit on this interface is 1500, which is the default for Ethernet. This means that no packet larger than 1500 bytes (including any headers) can be sent on this interface.
and finally the answer we were looking for: the L3/IP network address of this interfaces is - with a network mask of, which means that any address between and are assumed to be hosts on the same Ethernet L2 network.

Now back to the packet trace we captured from observing the ping command:

ARP, Request who-has tell, length 28
ARP, Reply is-at c4:2c:03:1c:f2:2e, length 46
IP > ICMP echo request, id 17785, seq 1, length 64
IP > ICMP echo reply, id 17785, seq 1, length 64
IP > ICMP echo request, id 17785, seq 2, length 64
IP > ICMP echo reply, id 17785, seq 2, length 64

The first 2 lines represent an exchange of the ARP or Address Resolution Protocol, which helps to associate IP network layer addresses with the corresponding link-layer MAC address.  Based on the network interface configuration above, the Linux kernel knows that the IP address should be located somewhere on the L2 network attached to the Ethernet port eth0, where it broadcasts an ARP request for that address, hoping that the host with that address will answer. Once the two IP layer endpoints know how to reach each other via the Ethernet network, they start exchanging ICMP request & reply packets. The mappings between MAC addresses and IP addresses on local interfaces are kept for a few minutes in the ARP cache of the kernel and then refreshed again when needed. We can use the arp command to see which IP to MAC address associations are currently active:

pi@raspberrypi ~ $ arp
Address                  HWtype  HWaddress           Flags Mask            Iface            ether   c4:2c:03:1c:f2:2e   C                     eth0              ether   58:6d:8f:d7:77:2c   C                     eth0

Besides the address which we just used, there is also another association in the ARP cache, which we have never used directly, but if we also look at the IP routing table in the Linux kernel, we can understand where this address is coming from:

pi@raspberrypi ~ $ netstat -nr
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface         UG        0 0          0 eth0   U         0 0          0 eth0

At the IP network layer, we can connect to many hosts outside our local L2 network, in fact to any host on the public Internet. Similar to postal codes, IP addresses are assigned in a logical and hierarchical fashion to make it easy to route packets to any destination using IP routers as layer-3 gateways between different layer-2 networks. Our routing table here contains two packet forwarding rules for how to reach different ranges of IP addresses: any address between and can be directly reached on the Ethernet L2 network on port eth0, while for all other destinations send packets to to be forwarded to the final destination. An since is iteself in the address range of the eth0 local network, we can find its MAC address mapping in the ARP cache. If several address ranges overlap like in this case, the most specific (i.e. the range with the least number of addresses) is chosen. This system of hierarchical address assignment and routing is also called Classless Inter-Domain Routing (CIDR) and used as the foundation of Internet routing since the early 1990ies.

The following packet trace is the result of downloading the little “favicon” icon from an address corresponding to Instead of a real web-browser, we use the wget command to trigger the http protocol exchange, but the low-level protocol flow is the same.

pi@raspberrypi ~ $ wget
--2014-02-22 01:08:26--
Connecting to connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [image/x-icon]

The download results in a TCP transport layer connection to be opened between (the IP address of our Raspberry P) and a remote host somewhere in the Internet. Any TCP connection is additionally identified by a source and destination port number pair (47598 and 80 respectively) which would allow multiple applications to open many parallel sessions between the same two hosts. Destination port numbers can also act as the well-known access point for reaching a particular service, for example TCP port 80 is the default for the http protocol.

IP > Flags [S], seq 2323672093, win 14600, options [mss 1460,sackOK,TS val 261765911 ecr 0,nop,wscale 2], length 0
IP > Flags [S.], seq 1259406109, ack 2323672094, win 42540, options [mss 1430,sackOK,TS val 659723559 ecr 261765911,nop,wscale 6], length 0
IP > Flags [.], ack 1, win 3650, options [nop,nop,TS val 261765912 ecr 659723559], length 0
IP > Flags [P.], seq 1:133, ack 1, win 3650, options [nop,nop,TS val 261765913 ecr 659723559], length 132
IP > Flags [.], ack 133, win 663, options [nop,nop,TS val 659723582 ecr 261765913], length 0
IP > Flags [.], seq 1:1419, ack 133, win 663, options [nop,nop,TS val 659723593 ecr 261765913], length 1418
IP > Flags [.], ack 1419, win 4374, options [nop,nop,TS val 261765916 ecr 659723593], length 0
IP > Flags [.], seq 1419:2837, ack 133, win 663, options [nop,nop,TS val 659723593 ecr 261765913], length 1418

[possibly omit some lines here]

IP > Flags [.], ack 2837, win 5098, options [nop,nop,TS val 261765916 ecr 659723593], length 0
IP > Flags [.], seq 2837:4255, ack 133, win 663, options [nop,nop,TS val 659723593 ecr 261765913], length 1418
IP > Flags [.], ack 4255, win 5822, options [nop,nop,TS val 261765916 ecr 659723593], length 0
IP > Flags [.], seq 4255:5673, ack 133, win 663, options [nop,nop,TS val 659723593 ecr 261765913], length 1418
IP > Flags [.], ack 5673, win 6546, options [nop,nop,TS val 261765916 ecr 659723593], length 0
IP > Flags [P.], seq 5673:5813, ack 133, win 663, options [nop,nop,TS val 659723593 ecr 261765913], length 140
IP > Flags [.], ack 5813, win 7255, options [nop,nop,TS val 261765916 ecr 659723593], length 0
IP > Flags [F.], seq 133, ack 5813, win 7255, options [nop,nop,TS val 261765919 ecr 659723593], length 0
IP > Flags [F.], seq 5813, ack 134, win 663, options [nop,nop,TS val 659723646 ecr 261765919], length 0
IP > Flags [.], ack 5814, win 7255, options [nop,nop,TS val 261765921 ecr 659723646], length 0

The main purpose of TCP is to provide a connection to transfer a stream of bytes reliably and in order across a potentially lossy L3 IP network. To achieve that, the stream of bytes is broken up into small enough segments that each fit into an IP packet to be sent across the network. If we continue the postal service analogy for IP, then TCP is like having a single conversation through a series of letters sent by registered mail with return receipt.

The TCP protocol adds a control header to each IP packet which allows it to reassemble the byte stream again at the receiver. To make sure that in the end no packets are missing, the receiver acknowledges (ACK) how much of the byte stream it has received yet based on the sequence numbers (SEQ) in the data packets.  If the sender does not get an acknowledgement for a particular sequence in time, it can  retransmit the potentially missing pieces. Part of assuring reliable transport of the data, TCP needs to negotiate a connection between the two parties of the conversation. The S & F flags (SYN and FIN respectively) in the packet trace above show the hand-shake by which sender and receiver negotiate the begin and end of the TCP connection.

TCP connections are always bi-directional and in this example of a http get request, a small request is first sent from our local client on which the remote server responds by sending the image data over the same connection in the reverse direction. Traditionally we call the initiator of the connection the client and the target of the connection the server.

UDP, the other common transport layer protocol of the Internet protocol suite is not much more than a simple wrapper around IP to provide similar multiplexing of application sessions using port numbers as we have seen for TCP.

In Linux based systems, the transport layer is typically the boundary of what is provided as a service by the kernel and what is implemented by some application in user space. We can use netstat to list all the active transport layer connections and services, including which processes own them (only as root).

pi@raspberrypi ~ $ sudo netstat -nap --inet
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0    *               LISTEN      1279/apache2  
tcp        0      0    *               LISTEN      1704/sshd      
tcp        0      0   *               LISTEN      1547/cupsd    
tcp        0    336     ESTABLISHED 2509/sshd: pi [priv
tcp        0      0      TIME_WAIT   -              
udp        0      0 *                           1387/dhclient  
udp        0      0  *                           1510/avahi-daemon:
udp        0      0    *                           1387/dhclient  
udp        0      0*                           1623/ntpd      
udp        0      0 *                           1623/ntpd      
udp        0      0   *                           1623/ntpd      
udp        0      0 *                           1510/avahi-daemon:

We can see that there are only 2 concrete TCP connections, an ssh session and the expiring state of the connection to the webserver used in the example above. All the other entries are service endpoints, which a particular process has created in order to tell the kernel that it would be ready to handle incoming TCP or UDP transport connections for a particular port number - e.g. port 22 is the standard service port for ssh or port 80 for http.

During this tour through the lowest levels of the Linux networking stack, we have been dealing exclusively with numerical addresses and port numbers, while as regular users of the Internet, we are accustomed to descriptive names instead. In one of the following episodes of Linux Toolshed, we will be exploring how hosts on the network get their names and addresses.

Give me a ping, Vasili!

The Linux command line has a rich set of powerful tools. Today we are looking at some examples of commands which allow us to troubleshoot networking issues.

Let’s picture a situation where our network connection to the Internet has been working fine yesterday, but suddenly seems to be down or at least has become very unreliable. We also know that the IP address of our firewall/router connecting to the Internet as and that for an arbitrary point out in the Internet, for example, the access address of the Google public DNS service.

One of the first tools an experienced network administrator might reach for in such a situation is the ping utility. It allows to test the connectivity and measure delay and packet loss to any host in the network. First things first, let’s see if we can still reach the router on our local network:

pi@raspberrypi ~ $ ping -c3
PING ( 56(84) bytes of data.
64 bytes from icmp_req=1 ttl=64 time=0.854 ms
64 bytes from icmp_req=2 ttl=64 time=0.782 ms
64 bytes from icmp_req=3 ttl=64 time=0.778 ms

--- ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
rtt min/avg/max/mdev = 0.778/0.804/0.854/0.047 ms

The ping command works by sending probe packets to the destination and waits for them to be sent back. For that it uses the special echo request and echo reply messages defined by the Internet Control Message Protocol (ICMP). A communication protocol is a well specified convention on how two systems can communicate with each other. In the case of ICMP, this convention is documented as an Internet standard in RFC792.

 Since ICMP echo request/reply processing is typically part of the lowest level of networking support in each properly implemented Internet host, it is a very reliably way to determine if a host can be reached over the network.

The ping utility is nearly as old as the Internet itself and is named after the eery sharp, metallic sound of an acoustic sonar probe, we might be familiar with from submarine movies.

From the output above we can see that our Internet gateway still exists on the network and that we can fairly reliably reach in less than a millisecond round-trip time. Out of 3 probe packets we sent, 3 responses were received and the fluctuation in the response time is quite low.

Instead of using ping -c 3 , we could also just use ping in which case the program runs forever until interrupted by the user, sending a request every second. This lets us observe the state of network connectivity over time, for example as we wiggle network cables or plug and unplug devices.

As we can see from man ping, there are many more options which can be specified. Some particularly interesting ones are:
  • -c count : only send <count> probes and then stop
  • -i interval : send a probe approximately every <interval> seconds (default 1 second)
  • -s size : send probe packets of size <packetsize> (default 64 bytes)
  • -n : don’t try to translate numeric IP addresses into hostnames
In another example

pi@raspberrypi ~ $ ping -c 3
PING ( 56(84) bytes of data.

--- ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2000ms

shows that the host with IP address is currently not reachable right now from our network. Since we can still reach our router, but not an arbitrary address outside, we now suspect that our local area network might be working fine, but that there might be a problem with the connection of our router to the Internet.

Besides cases with either 100% success and 100% failure, there can be situations with irregular packet loss, which might be caused by flaky network cables, loose connectors, unstable or overloaded gateways in the network. Even without or with low packet loss, high delays or high variation of delay might degrade the performance of higher level protocols like http or ssh.

Like any proper Internet host, the Linux kernel in our own Raspberry Pi contains a responder for ICMP echo requests and we can effectively test our own networking stack and how fast the our kernel can process small packets:

pi@raspberrypi ~ $ ping -c 5  localhost
PING localhost ( 56(84) bytes of data.
64 bytes from localhost ( icmp_req=1 ttl=64 time=0.153 ms
64 bytes from localhost ( icmp_req=2 ttl=64 time=0.155 ms
64 bytes from localhost ( icmp_req=3 ttl=64 time=0.163 ms
64 bytes from localhost ( icmp_req=4 ttl=64 time=0.146 ms
64 bytes from localhost ( icmp_req=5 ttl=64 time=0.201 ms

--- localhost ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4005ms
rtt min/avg/max/mdev = 0.146/0.163/0.201/0.024 ms

While ping is a quick and easy way to determine if we can reach a destination or not, traceroute allows us to find out more about each hop our traffic is taking towards the destination. For example, in the situation above, we can use traceroute to see where exactly the traffic stops going through and where the problem might be:

pi@raspberrypi ~ $ traceroute
traceroute to (, 30 hops max, 60 byte packets
 1 (  0.885 ms  0.800 ms  0.838 ms
 2  * * *
 3 (  35.432 ms  35.328 ms  35.082 ms
 4 (  34.854 ms  34.642 ms  34.429 ms
 5 (  33.977 ms  33.460 ms  33.244 ms
 6  * * *
 7  * * *
 8  * * *
 9  * * *
10  * * *
11  * * *
12  * * *
13  * * *
14  * * *
15  * * *

Which appears to be a problem a few hops away from our Internet connection itself. And indeed, a few moments later, the service is restored and we can now again successfully reach the destination:

pi@raspberrypi ~ $ traceroute
traceroute to (, 30 hops max, 60 byte packets
 1 (  0.965 ms  0.906 ms  1.072 ms
 2  * * *
 3 (  11.581 ms  17.977 ms  17.604 ms
 4 (  20.844 ms  20.765 ms  20.367 ms
 5 (  15.715 ms  15.642 ms  15.240 ms
 6 (  20.088 ms  13.548 ms (  13.250 ms
 7 (  18.371 ms (  21.770 ms (  20.267 ms
 8 (  20.250 ms (  20.315 ms (  19.785 ms
 9 (  20.204 ms (  31.628 ms (  19.167 ms
10  * * *
11 (  19.840 ms  19.633 ms  19.524 ms

The traceroute command shows addresses and hostnames of all the router nodes which packets are going through from our Raspberry Pi to the destination. For that, traceroute takes advantage of another ICMP feature, the time-to-live expiry message. All packets which are sent through the Internet have a limit set of how many times they can be passed on by routers and which is decremented at each hop to prevent packets going around forever if they can’t find their destination. When a pack is discarded in the network, the router sends out an ICMP message to alert the sender that the packet has been discarded.

In order to discover a network path, traceroute sends out a series of packets (default 3) with a time-to-live limited to 1, just to see who will send back an ICMP time exceeded message and then repeats this process with increasing time-to-live values until it reaches the destination.

There are many more useful commands to look at the state of the network or test its performance, but the main  advantage of tools like ping and traceroute is  that they work directly with support deep in the operating system kernel. This can sometimes mean that a computer is still responding to ping requests, even if it otherwise appears to be completely stuck or has no networking applications running.

When some network application like ssh or web browsing are not working properly, tools like ping and traceroute are great to figure out whether there is a low-level networking problem or the problem is maybe with the application itself.

In a future episode, we will take a closer look at layers of networking support in the Linux kernel and some tools to look at them. Until then, can you find out from where we can know, what is the address of our local Internet gateway? Hint: have a look at the netstat command.

A similar version of this article appeared in The MagPi Magazine issue 21.