Sunday, May 15, 2016

A Matter of Protocol

Today, we are taking a look at the TCP/IP Internet protocols by using a few commands which allow us to see what is going on in the Linux kernel, all the way down to the Ethernet wire.

For example, the basic ping command we have seen in a previous post might result in the following sequence of packets sent on the Ethernet port of the Raspberry Pi:

ARP, Request who-has 192.168.1.143 tell 192.168.1.148, length 28
ARP, Reply 192.168.1.143 is-at c4:2c:03:1c:f2:2e, length 46
IP 192.168.1.148 > 192.168.1.143: ICMP echo request, id 17785, seq 1, length 64
IP 192.168.1.143 > 192.168.1.148: ICMP echo reply, id 17785, seq 1, length 64
IP 192.168.1.148 > 192.168.1.143: ICMP echo request, id 17785, seq 2, length 64
IP 192.168.1.143 > 192.168.1.148: ICMP echo reply, id 17785, seq 2, length 64

But before we are trying to understand what is going on here, a brief excursion into the history and theory of packet networking.

In the 1960ies, researchers were starting to think of packet switching  as a more economical and resilient alternative to connecting computers directly to each other with wires.  The idea was to break up communications into small blocks or “packets” and send them loosely and independently across a packet-switched network to a destination which then interprets them again as stream of communication. This is similar to how the postal mail system worked for centuries and one could imagine, if the postal system were fast enough, it would be possible to hold a conversation in real time by sending series of short letters back and forth really fast. In fact, nearly all telephone conversations are transmitted today in such a way over packet switched networks, without any noticeable impact to the user.

When through the 1970ies many mutually incompatible implementations of packet based computer networks started to emerge, the International Standards Organization (ISO) attempted to define an open and global architecture  that should enable a single world-wide computer network - similar to the already existing telephone system.

Over the following decades, the growing popularity of the much simpler Internet Protocol (IP) suite made this dream a reality in the form of the global Internet. Network engineers nonetheless use the comprehensive 7 layer Open System Interconnections (OSI) reference model to reason and talk about computer network architecture.



The OSI model defines a 7 layer stack of protocols, each communicating logically with peers at its layer and  each providing a specific functions:


  1. The Physical Layer deals with defining electrical signals over wires, modulations, radio-frequency spectrum, properties of lasers and fiber optic cables and so on.
  2. The Link Layer is concerned with creating a local network connection between 2 or systems using a single physical medium. The most popular families of link-layers are point-point (e.g. for fiber optic cables) and shared or multiple-access networks like Ethernet support by the Raspberry Pi model A. Many of those are defined in the 802.X family of standards issued by the Institute for Electrical and Electronics Engineers (IEEE).
  3. The purpose of the Network Layer is to provide a uniform and universal way to address and reach any connected node globally no matter where it is and what kind of (link-layer) network technology it is connected to. It fuses several heterogeneous and mutually incompatible networks into a single “meta” or “Inter”-network. The Internet Protocol (IP) proposed in 1980 for this purpose is what gave the Internet its name.
  4. The Transport Layer is responsible for getting a logical stream of data somewhat reliably across the underlying network. The most popular implementations are the Transport Control Protocol (TCP) providing a reliable, byte-stream service and User Datagram Protocol which provides a more direct access to the unreliably, packet based service of an IP network.
  5. The Session Layer could be combined with the transport layer, e.g. in the case of TCP which requires a strong session concept. Some cases of session management and multiplexing on top of a single transport stream e.g. HTTP1.1, SPDY or explicit session management and control protocols for real-time streaming like SIP or RTCP could be considered as session layer functions.
  6. The Presentation Layer deals with data representation and encoding. This could be things like ASCII or Unicode text encoding, html, xml, JSON etc.
  7. The Application Layer provides the use-case and purpose of an entire specialized protocol stack. This could for example be file-transfer (FTP or RSYNC), remote login (telnet or ssh), email, web service and many more.


In the simplified view below for the TCP/IP or Internet protocol suite, layers 2,3 & 4 are clearly separated and part of the Linux kernel itself, which the upper layers are much more fuzzy and potentially part of complex user-space applications - e.g. a web-server or browser implementation.


The trace at the beginning of the article was generated with a tool called tcpdump.
It allows to capture and decode all traffic that is sent and received by a link-layer interface in the kernel.

WARNING: Intercepting traffic from other users on shared networks might be considered unlawful in some countries or against policy in some organizations (schools, universities or companies). If this situation might apply to you, please check with your network administrator first before starting to play with tcpdump!

The tcppdump tool is typically not pre-installed in Raspbian, but we can easily get it with sudo apt-get install tcpdump.

 The above trace is then generated by running

pi@raspberrypi ~ $ sudo tcpdump -nt  host 192.168.1.148 and not tcp port 22

in one window and for example
pi@raspberrypi ~ $ ping 192.168.1.143

in another. The command line options -nt filter out name resolution and timestamps respectively and just help to make the output a bit more compact for the example. tcpdump supports many types of query filters which allow to restrict what is being captured and decoded to a particular protocol type, application type or address range.

In particular when trying to run tcpdump over an ssh session on the same network we are trying to capture from, we would easily create a nasty feedback loop trying to generate an infinite amount of output. Restricting the capture to only our local interface address and NOT including any traffic on port 22 - which is the ssh protocol allows us to look at all network traffic generated by the Raspberry Pi, not including the ssh sessions.

But how do we find what our own network addresses are? For that, the ifconfig command provides a lot of useful information about layer 1-3 state of all our network interfaces:

pi@raspberrypi ~ $ ifconfig -a
eth0      Link encap:Ethernet  HWaddr b8:27:eb:13:f7:57
          inet addr:192.168.1.148  Bcast:192.168.1.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:17038947 errors:8 dropped:7 overruns:0 frame:7
          TX packets:11258008 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:3037900297 (2.8 GiB)  TX bytes:1335045310 (1.2 GiB)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:66130 errors:0 dropped:0 overruns:0 frame:0
          TX packets:66130 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:19241026 (18.3 MiB)  TX bytes:19241026 (18.3 MiB)

From that we can learn that
we have 2 network interfaces - eth0 and lo
at the link-layer eth0 is an Ethernet interface (802.3), while lo is a virtual dummy/loopback interface
Focusing a bit more on the Ethernet interface we are interested in:
it has a HWaddr of b8:27:eb:13:f7:57 which is the MAC (Media Access Control) address used by hosts on a  IEEE 802.X multi-access networks (like Ethernet, WiFI and others) to communicate with each other on the local L2 network. This address is assigned to each 802.X compatible interface by the hardware manufacturer as a kind of globally unique serial number. The prefix B8:27:EB is assigned to the Raspberry Pi Foundation to generate all its necessary addresses for the model B Ethernet ports.
the MTU or Maximum Transfer Unit on this interface is 1500, which is the default for Ethernet. This means that no packet larger than 1500 bytes (including any headers) can be sent on this interface.
and finally the answer we were looking for: the L3/IP network address of this interfaces is 192.168.1.148 - with a network mask of 255.255.255.0, which means that any address between 192.168.1.1 and 192.168.1.254 are assumed to be hosts on the same Ethernet L2 network.

Now back to the packet trace we captured from observing the ping command:

ARP, Request who-has 192.168.1.143 tell 192.168.1.148, length 28
ARP, Reply 192.168.1.143 is-at c4:2c:03:1c:f2:2e, length 46
IP 192.168.1.148 > 192.168.1.143: ICMP echo request, id 17785, seq 1, length 64
IP 192.168.1.143 > 192.168.1.148: ICMP echo reply, id 17785, seq 1, length 64
IP 192.168.1.148 > 192.168.1.143: ICMP echo request, id 17785, seq 2, length 64
IP 192.168.1.143 > 192.168.1.148: ICMP echo reply, id 17785, seq 2, length 64

The first 2 lines represent an exchange of the ARP or Address Resolution Protocol, which helps to associate IP network layer addresses with the corresponding link-layer MAC address.  Based on the network interface configuration above, the Linux kernel knows that the IP address 192.168.1.143 should be located somewhere on the L2 network attached to the Ethernet port eth0, where it broadcasts an ARP request for that address, hoping that the host with that address will answer. Once the two IP layer endpoints know how to reach each other via the Ethernet network, they start exchanging ICMP request & reply packets. The mappings between MAC addresses and IP addresses on local interfaces are kept for a few minutes in the ARP cache of the kernel and then refreshed again when needed. We can use the arp command to see which IP to MAC address associations are currently active:

pi@raspberrypi ~ $ arp
Address                  HWtype  HWaddress           Flags Mask            Iface
192.168.1.143            ether   c4:2c:03:1c:f2:2e   C                     eth0
192.168.1.1              ether   58:6d:8f:d7:77:2c   C                     eth0

Besides the address which we just used, there is also another association in the ARP cache, which we have never used directly, but if we also look at the IP routing table in the Linux kernel, we can understand where this address is coming from:

pi@raspberrypi ~ $ netstat -nr
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
0.0.0.0         192.168.1.1     0.0.0.0         UG        0 0          0 eth0
192.168.1.0     0.0.0.0         255.255.255.0   U         0 0          0 eth0

At the IP network layer, we can connect to many hosts outside our local L2 network, in fact to any host on the public Internet. Similar to postal codes, IP addresses are assigned in a logical and hierarchical fashion to make it easy to route packets to any destination using IP routers as layer-3 gateways between different layer-2 networks. Our routing table here contains two packet forwarding rules for how to reach different ranges of IP addresses: any address between 192.168.1.0 and 192.168.1.255 can be directly reached on the Ethernet L2 network on port eth0, while for all other destinations send packets to 192.168.1.1 to be forwarded to the final destination. An since 192.168.1.1 is iteself in the address range of the eth0 local network, we can find its MAC address mapping in the ARP cache. If several address ranges overlap like in this case, the most specific (i.e. the range with the least number of addresses) is chosen. This system of hierarchical address assignment and routing is also called Classless Inter-Domain Routing (CIDR) and used as the foundation of Internet routing since the early 1990ies.

The following packet trace is the result of downloading the little “favicon” icon from an address corresponding to www.google.com. Instead of a real web-browser, we use the wget command to trigger the http protocol exchange, but the low-level protocol flow is the same.

pi@raspberrypi ~ $ wget http://173.194.113.115/favicon.ico
--2014-02-22 01:08:26--  http://173.194.113.115/favicon.ico
Connecting to 173.194.113.115:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [image/x-icon]

The download results in a TCP transport layer connection to be opened between 192.169.1.148 (the IP address of our Raspberry P) and a remote host 173.194.113.115 somewhere in the Internet. Any TCP connection is additionally identified by a source and destination port number pair (47598 and 80 respectively) which would allow multiple applications to open many parallel sessions between the same two hosts. Destination port numbers can also act as the well-known access point for reaching a particular service, for example TCP port 80 is the default for the http protocol.

IP 192.168.1.148.47598 > 173.194.113.115.80: Flags [S], seq 2323672093, win 14600, options [mss 1460,sackOK,TS val 261765911 ecr 0,nop,wscale 2], length 0
IP 173.194.113.115.80 > 192.168.1.148.47598: Flags [S.], seq 1259406109, ack 2323672094, win 42540, options [mss 1430,sackOK,TS val 659723559 ecr 261765911,nop,wscale 6], length 0
IP 192.168.1.148.47598 > 173.194.113.115.80: Flags [.], ack 1, win 3650, options [nop,nop,TS val 261765912 ecr 659723559], length 0
IP 192.168.1.148.47598 > 173.194.113.115.80: Flags [P.], seq 1:133, ack 1, win 3650, options [nop,nop,TS val 261765913 ecr 659723559], length 132
IP 173.194.113.115.80 > 192.168.1.148.47598: Flags [.], ack 133, win 663, options [nop,nop,TS val 659723582 ecr 261765913], length 0
IP 173.194.113.115.80 > 192.168.1.148.47598: Flags [.], seq 1:1419, ack 133, win 663, options [nop,nop,TS val 659723593 ecr 261765913], length 1418
IP 192.168.1.148.47598 > 173.194.113.115.80: Flags [.], ack 1419, win 4374, options [nop,nop,TS val 261765916 ecr 659723593], length 0
IP 173.194.113.115.80 > 192.168.1.148.47598: Flags [.], seq 1419:2837, ack 133, win 663, options [nop,nop,TS val 659723593 ecr 261765913], length 1418

[possibly omit some lines here]

IP 192.168.1.148.47598 > 173.194.113.115.80: Flags [.], ack 2837, win 5098, options [nop,nop,TS val 261765916 ecr 659723593], length 0
IP 173.194.113.115.80 > 192.168.1.148.47598: Flags [.], seq 2837:4255, ack 133, win 663, options [nop,nop,TS val 659723593 ecr 261765913], length 1418
IP 192.168.1.148.47598 > 173.194.113.115.80: Flags [.], ack 4255, win 5822, options [nop,nop,TS val 261765916 ecr 659723593], length 0
IP 173.194.113.115.80 > 192.168.1.148.47598: Flags [.], seq 4255:5673, ack 133, win 663, options [nop,nop,TS val 659723593 ecr 261765913], length 1418
IP 192.168.1.148.47598 > 173.194.113.115.80: Flags [.], ack 5673, win 6546, options [nop,nop,TS val 261765916 ecr 659723593], length 0
IP 173.194.113.115.80 > 192.168.1.148.47598: Flags [P.], seq 5673:5813, ack 133, win 663, options [nop,nop,TS val 659723593 ecr 261765913], length 140
IP 192.168.1.148.47598 > 173.194.113.115.80: Flags [.], ack 5813, win 7255, options [nop,nop,TS val 261765916 ecr 659723593], length 0
IP 192.168.1.148.47598 > 173.194.113.115.80: Flags [F.], seq 133, ack 5813, win 7255, options [nop,nop,TS val 261765919 ecr 659723593], length 0
IP 173.194.113.115.80 > 192.168.1.148.47598: Flags [F.], seq 5813, ack 134, win 663, options [nop,nop,TS val 659723646 ecr 261765919], length 0
IP 192.168.1.148.47598 > 173.194.113.115.80: Flags [.], ack 5814, win 7255, options [nop,nop,TS val 261765921 ecr 659723646], length 0

The main purpose of TCP is to provide a connection to transfer a stream of bytes reliably and in order across a potentially lossy L3 IP network. To achieve that, the stream of bytes is broken up into small enough segments that each fit into an IP packet to be sent across the network. If we continue the postal service analogy for IP, then TCP is like having a single conversation through a series of letters sent by registered mail with return receipt.

The TCP protocol adds a control header to each IP packet which allows it to reassemble the byte stream again at the receiver. To make sure that in the end no packets are missing, the receiver acknowledges (ACK) how much of the byte stream it has received yet based on the sequence numbers (SEQ) in the data packets.  If the sender does not get an acknowledgement for a particular sequence in time, it can  retransmit the potentially missing pieces. Part of assuring reliable transport of the data, TCP needs to negotiate a connection between the two parties of the conversation. The S & F flags (SYN and FIN respectively) in the packet trace above show the hand-shake by which sender and receiver negotiate the begin and end of the TCP connection.

TCP connections are always bi-directional and in this example of a http get request, a small request is first sent from our local client on which the remote server responds by sending the image data over the same connection in the reverse direction. Traditionally we call the initiator of the connection the client and the target of the connection the server.

UDP, the other common transport layer protocol of the Internet protocol suite is not much more than a simple wrapper around IP to provide similar multiplexing of application sessions using port numbers as we have seen for TCP.

In Linux based systems, the transport layer is typically the boundary of what is provided as a service by the kernel and what is implemented by some application in user space. We can use netstat to list all the active transport layer connections and services, including which processes own them (only as root).

pi@raspberrypi ~ $ sudo netstat -nap --inet
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN      1279/apache2  
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      1704/sshd      
tcp        0      0 0.0.0.0:631             0.0.0.0:*               LISTEN      1547/cupsd    
tcp        0    336 192.168.1.148:22        192.168.1.141:54757     ESTABLISHED 2509/sshd: pi [priv
tcp        0      0 192.168.1.148:50229     173.194.113.115:80      TIME_WAIT   -              
udp        0      0 0.0.0.0:61409           0.0.0.0:*                           1387/dhclient  
udp        0      0 0.0.0.0:5353            0.0.0.0:*                           1510/avahi-daemon:
udp        0      0 0.0.0.0:68              0.0.0.0:*                           1387/dhclient  
udp        0      0 192.168.1.148:123       0.0.0.0:*                           1623/ntpd      
udp        0      0 127.0.0.1:123           0.0.0.0:*                           1623/ntpd      
udp        0      0 0.0.0.0:123             0.0.0.0:*                           1623/ntpd      
udp        0      0 0.0.0.0:57724           0.0.0.0:*                           1510/avahi-daemon:

We can see that there are only 2 concrete TCP connections, an ssh session and the expiring state of the connection to the webserver used in the example above. All the other entries are service endpoints, which a particular process has created in order to tell the kernel that it would be ready to handle incoming TCP or UDP transport connections for a particular port number - e.g. port 22 is the standard service port for ssh or port 80 for http.

During this tour through the lowest levels of the Linux networking stack, we have been dealing exclusively with numerical addresses and port numbers, while as regular users of the Internet, we are accustomed to descriptive names instead. In one of the following episodes of Linux Toolshed, we will be exploring how hosts on the network get their names and addresses.