Sunday, May 15, 2016

When you come to a fork() in the code, take it!

Linux is a multi-user, multi-tasking based system, which means that even a computer as small as the Raspberry Pi, can be used by multiple users simultaneously and there can be multiple processes executing (seemingly) all at once. For example, here are all the processes currently running for the user pi:

pi@raspberrypi ~ $ ps -fu pi
UID        PID  PPID  C STIME TTY          TIME CMD
pi        4792  4785  0 Mar11 ?        00:00:04 sshd: pi@pts/0   
pi        4793  4792  0 Mar11 pts/0    00:00:04 -bash
pi        6137  6130  0 00:30 ?        00:00:00 sshd: pi@pts/1   
pi        6138  6137  1 00:30 pts/1    00:00:01 -bash
pi        6185  4793  0 00:32 pts/0    00:00:00 tail -f /var/log/messages
pi        6186  6138  0 00:32 pts/1    00:00:00 ps -fu pi

Using a time sharing CPU scheduler and virtual memory, each process on Linux is led to believe that it has the whole computer all to itself, even if in reality the Linux operating system kernel is busy managing resources in the background to maintain this illusion.

Processes are among the most important concepts in Linux. A process is essentially a container for volatile resources like memory, network connection, open file handles etc. and is also associated with at least one thread of program execution. Much of the robustness of Linux is thanks to the containment and isolation which processes provide: when a program crashes only its process is terminated and cleaned up, and it doesn’t bring down the whole system.

Process Management

But how do we create such a process? Well, technically we don’t - we fork()it. Which means that a new process appears from an existing process making an replica of itself, using the fork() system call. After the fork, the user-space state of both processes is identical, except for the return value of fork, which indicates if a process is the original or the copy, which are called parent and child process respectively.

If we have a look at the following example program, fork.c :

#include <stdio.h>
#include <unistd.h>

int main()
{
  int x = 42;

  switch (fork()) {
  case -1:
    perror("fork failed");
    return 1;
    break;
  case 0:
    x = 123;
    printf("this is a new child process:\n");
    printf("  pid=%d, value of x=%d @ memory address 0x%lx\n\n"
, getpid(), x, &x);
    break;
  default:
    sleep(2);
    printf("this is the original parent process:\n");
    printf("  pid=%d, value of x=%d @ memory address 0x%lx\n",
getpid(), x, &x);
    break;
  }
  return 0;
}

Which we can compile with gcc -o fork fork.c and get the following execution:

 pi@raspberrypi ~ $ ./fork
this is a new child process:
  pid=6103, value of x=123 @ memory address 0xbee006d4

this is the original parent process:
  pid=6102, value of x=42 @ memory address 0xbee006d4
pi@raspberrypi ~ $ 

What we can see is that 2 different branches of the switch statement have been executed, but each in its own process. The parent process has entered the fork call, but two of them have returned from it. Based on the return code of fork(), they can self-identify themselves as either the original parent process or a new child copy of it and take different actions based on that.

We can also see that the variable x, which existed before the fork() in the parent now exists in both processes, even at exactly the same address location in memory!  But changes to the variable in one process is not reflected in the other one - even though they appear to share the same memory, they are in fact separate and isolated from each other.

The example below shows the “family tree” of all the processes for user pi at this moment:

pi@raspberrypi ~ $ ps fx
  PID TTY      STAT   TIME COMMAND
 7983 ?        S      0:00 sshd: pi@pts/1   
 7984 pts/1    Ss     0:01  \_ -bash
 8044 pts/1    R+     0:00      \_ ps fx
 7961 ?        S      0:00 sshd: pi@pts/0   
 7962 pts/0    Ss     0:01  \_ -bash
 8042 pts/0    S+     0:00      \_ ./fork
 8043 pts/0    Z+     0:00          \_ [fork] <defunct>

We can see the 2 processes from the fork example with the child having already exited and being in “zombie” state, waiting for its return code to be collected by a parent. The parent of our fork-parent is a bash shell (see previous tutorial). In fact, bash runs other programs by forking itself and then replacing the executable image of the child with the new command (using the exec() system call). Some processes are attached to a terminal for an interactive user session, still named TTY from the days, when most terminal session were teletype printer terminals. Some like the sshd processes are background processes, also called servers or daemon.

CPU Time-sharing

We can also see that only one process is ready to run right now - the ps tool itself. All others are sleeping and waiting for some sort of event, for example user input, a timeout or some system resource to become available. Many processes on Linux spend the vast majority of their time waiting for something without using any CPU resources.

pi@raspberrypi ~ $ ps fx
  PID TTY      STAT   TIME COMMAND
  7961 ?        S      0:00 sshd: pi@pts/0   
 7962 pts/0    Ss     0:02  \_ -bash
 8170 pts/0    R+     0:12      \_ yes
 8171 pts/0    R+     0:13      \_ gzip

The above is a nonsensical example of  a CPU intensive job by running yes | gzip > /dev/null . In this case, there are now 2 processes actively competing for the CPU, which means that the Linux kernel will alternately let them execute for a bit before interrupting them and allow some other active process to take a turn.

For a more dynamic view of the process state, we can also use the top command, which while running periodically queries the state of all processes and ranks them by top CPU usage or some other metric:

top - 22:29:59 up 18 days,  8:12,  2 users,  load average: 1.70, 1.46, 0.91
Tasks:  77 total,   2 running,  75 sleeping,   0 stopped,   0 zombie
%Cpu(s): 91.6 us,  8.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
KiB Mem:    220592 total,   202920 used,    17672 free,    24444 buffers
KiB Swap:   102396 total,       48 used,   102348 free,    95828 cached

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND 
 8171 pi        20   0  2244  816  404 R  52.0  0.4   6:49.68 gzip
 8170 pi        20   0  3156  496  428 S  45.9  0.2   5:58.95 yes
 8185 pi        20   0  4652 1432 1028 R   1.6  0.6   0:06.81 top
 7983 pi        20   0  9852 1636  996 S   0.6  0.7   0:02.77 sshd
    1 root      20   0  1840  668  572 S   0.0  0.3   1:04.67 init
...                                                                                                                                        
         
There are currently 5 processes more or less active: yes & gzip doing busy work, top periodically displaying the processes state and sshd sending that output data over SSH to a remote computer.

Virtual Memory

Besides time-sharing the CPU between all the processes which compete for it, the Linux operating system kernel also manages another important resources: main memory.

As we remember from the fork example, both processes seem to access the same address in main memory, but find there different values! What seems like magic is the concept of virtual memory, a crucial component of a multi-process system.

With the help of the Memory Management Unit (MMU), a special component in the CPU hardware, the operating system maps a virtual address space for each process to the real available memory and creating the illusion that each process has  4 gigabytes of memory (the full range of a 32bit address) at its disposal, when in reality the entire Raspberry Pi only has 512 megabytes of physical main memory. Given there were 77 processes in our system, how can 77 times 4 gigabytes add up to 512 megabytes? The trick is, does memory really have to be there if nobody is accessing it?

The system partitions the 4GB addressable memory space into thousands of small segments, called pages. When a process tries to access a particular address, the hardware intercepts the access and lets the OS intervene and quickly put some real memory there, if there isn’t already. This procedure is called a page fault. Depending on what is supposed to be on this page, the operating system has a few options on how to do this. If this page is supposed to be part of the executable binary stored on disk, then the OS can simple get an empty page of memory from its pool and fill it with the corresponding data from disk. If the process needs more memory for its dynamic data (e.g. for the heap or stack of the executing program), it just get an empty page. Things get more tricky when the operating system runs out of empty pages. In this case it will try to take away some rarely used ones from another process - if they were mapped from a file, it can simply throw away the data as it already exists on disk anyway, if it was dynamic data, it has to write the data to a special file, which is called the system swap-file, used for swapping data in and out of main memory.

Swapping is a last resort and often degrades the performance of a system beyond being useful, as disk is so much slower than main memory. But it prevents the system from crashing allows the administrator to somehow reduce the load.

Fortunately, most processes use a lot less memory than their 4GB address space. Each process contains the static executable code and data mapped from the program file on disk, some regions where it stores its dynamic data (e.g. that variable “x”) and some space to map in shared libraries and other resources. For the rest, the address space can be as empty as outer space.

Top or ps can be used to look at the memory state of a process. In the example output of top above, we can see that gzip is currently using in some way 2’244KB of its 4GB address space. Out of which only 816KB are currently mapped into real physical memory, plus another  404KB of memory shared with other processes, e.g. for using common shared system libraries.

We can also use ps to show many possible output fields, in particular here major and minor page-faults. Major faults require loading from disk, while for minor ones the data is either volatile or still in memory (e.g. from a previous execution of the same command).

pi@raspberrypi ~ $ ps x -o vsz,rsz,%mem,%cpu,maj_flt,min_flt,cmd
   VSZ   RSZ %MEM %CPU  MAJFL  MINFL CMD
  9852   364  0.1  0.0     32    727 sshd: pi@pts/0   
  6336  1244  0.5  0.0     54  10353 -bash
  9852   356  0.1  0.0     76   1167 sshd: pi@pts/1   
  6292  1264  0.5  0.0    156  11943 -bash
  3172   500  0.2  0.1      4    315 cat /dev/random
  2244   588  0.2  0.1      2    333 gzip
  3508   796  0.3  0.1      3    400 grep --color=auto asdf
  4092   932  0.4  0.0      0    358 ps x -o vsz,rsz,%mem,%cpu,maj_flt,min_flt,cmd

If we are interested in a summary of process performance metrics of a particular executable, we can also use time (install with sudo apt-get install time). Because it is shadowed by a built-in bash function with the same name, we need to run it with its fully qualified path:

pi@raspberrypi ~ $ /usr/bin/time -v gcc -o fork fork.c
Command being timed: "gcc -o fork fork.c"
User time (seconds): 0.60
System time (seconds): 0.20
Percent of CPU this job got: 53%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.49
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 6624
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 85
Minor (reclaiming a frame) page faults: 4907
Voluntary context switches: 194
Involuntary context switches: 214
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

We can see that this command only reaches about a 50% CPU utilization due to waiting for disk I/O - partially caused by the 85 page faults requiring to read in executable code from disk. Running the same command a second time, yields a 95% CPU utilization without any page faults, as the kernel hasn’t reused the pages yet from the last time.