Since our Internet connection had recently been down a few times, when I did notice it and I was getting curious about how frequently this was happening.
Besides, it might be interesting to get a long term record on the stability and "quality" of our ISP and maybe even compare it with the results from users of competing ISPs.
What does it mean for Internet access to be working? Is it enough to check that the link between our home router and the ISPs access router is up and working (a DOCSIS cable plant in my case). Or should we include end-end application layer scenarios like the ability to get my email or my files from some place "in the cloud"?
But what exactly represents the "the cloud" or "the Internet"? In reality they are massively large distributed systems at a global scale, consisting of millions of components, points of failure and recovery.
For the purpose of these measurements, we need to choose a few relevant and representative destinations as being the stand-ins for the whole Internet. Since many people seem to think, that Google, Facebook, Amazon or other top-tier web properties ARE the Internet, we can as well use them as a reasonable proxy for the Internet. Depending on the mix of Internet or "cloud" services we use on a daily basis, it should be easy to come up with a short-list of destinations and services which we particularly care about. We also need to be very conscious of the load, these measurements are putting on the services and whether their owners would likely object. E.g. choosing very popular services which already have very high traffic loads, would help to mitigate the additional impact of the probes. Running these measurements should have less impact on any part of the Internet infrastructure than leaving a few browser tabs with AJAX apps open over night...
Without access to the low-level network and systems monitoring the most basic way to judge Internet connectivity would require to periodically probe whether some destination or service is currently reachable. There are quite a few open-source network and service monitoring tools available, but for our goal of long-term automated connectivity testing, the most natural choice might be SmokePing, an open-source tool, byt the author or MRTG and RRDtool, very popular among IP network admins.
SokePing gets its name from plotting latency, jitter (variation in latency) and packet loss in a single graph, drawing jitter as a smoky cloud around the line of median round trip time, as in the example below:
A low-cost, low-power Raspberry Pi in headless mode, which can be left in headless mode attached to the Internet gateway, would seem like an ideal platform for such monitoring & measurements. And fortunately, SmokePing already comes pre-packaged with all its dependencies (Perl, Apache etc.) for Raspbian, so installing it is as easy as:
sudo apt-get install smokeping
After that, we can edit the config files in /etc/smokeping/config.d/ to determine what kind of probing to run and how to display the results in the web interface. All the configuration settings are documented here.
The smokePing prober did not start properly in the default configuration on my system, because of a missing reference to sendmail. Removing the sendmail line from /etc/smokeping/config.d/pathnames did resolve the problem. The Raspberry Pi has more than enough compute power to run the smokePing prober for a small configuration as outlined below. Running the web front-end, which generates the graphs from the RRD timeseries database, can be a bit taxing on one's patience as rendering a each page takes about 10 to 15 seconds at 100% CPU load.
Measurement SetupThe most basic, low-level and least intrusive way to do connectivity probing in IP networks is using the ICMP echo protocol implemented in the kernel as part of the IP stack and by the ping network diagnostic utility. As targets for the ping probes, we choose the front-door address of some major Internet companies: www.google.com, www.facebook.com and www.yahoo.com. All of these are logical addresses, which map to a long list are heavily replicated and geographically dispersed physical machines, assigned by a DNS load-balancer based on availability and proximity. These 3 companies and sites represent a good part of the Internet traffic and are unlikely to go down, specially not all 3 at the same time. Should the pings to all these 3 destinations suddenly start failing, the outage would most likely be our Internet access connection of the access network of our ISP. SmokePing is configured to use 10 probes to each destination every 5 min to collect packet delay and loss information.
In order to locate the actual server to probe, ping relies of the domain name system (DNS), itself a highly replicated and distributed infrastructure at the very core of the Internet. In order to isolate IP connectivity from name service issues, we are setting up a secondary set of probes using DNS queries directly to the physical IP address of the primary name server of our ISP and to the well-know 220.127.116.11 address of the Google public DNS service (itself heavily replicated using BGP anycast routing). Since this is starting to hit more complex services in application space, we are reducing the polling rate to 5 probes each every 15 minutes.
Clearly, low latency, latency variance (jitter) and packet loss rates are an important part of network performance, but do not give the full picture. Ideally it would be nice to also measure the available bandwidth, most representative of perceived network "speed". However doing so requires expensive, heavy load probes, which try to saturate the network path to estimate its capacity limit. Doing so regularly in an automated long-term test would seem a bit frivolous and wasteful.
As as small compromise of qualifying how the network performs for common "cloud" services, we are using a http probe, which downloads a copy of the photo above from a public folder in dropbox, a cloud based file storage service and from the static user content server of the Google+ photo service.
Here is the core of the setup for these experiments in /etc/smokeping/config.d/Probes and /etc/smokeping/config.d/Targets respectively:
*** Probes *** + FPing binary = /usr/bin/fping step = 300 pings = 10 +EchoPingDNS binary = /usr/bin/echoping step = 900 pings = 5 +EchoPingHttp binary = /usr/bin/echoping step = 900 pings = 3
*** Targets *** probe = FPing menu = Top title = Raspberry Pi Internet Access Monitor remark = Latency to a few select sites and services in the Internet. + Internet menu = Internet title = Internet Access (Ping) ++ Google title = Google menu = Google host = www.google.com ++ FB title = Facebook menu = Facebook host = www.facebook.com ++ Yahoo title = Yahoo menu = Yahoo host = www.yahoo.com + DNS menu = DNS title = Name Servers ++ gdns title = Google public DNS menu = Google public DNS probe = EchoPingDNS dns_request = www.google.com host = 18.104.22.168 ++ cablecom title = Cablecom DNS menu = Cablecom probe = EchoPingDNS dns_request = www.google.com host = <Your ISPs primary NS IP address> + Cloud menu = Cloud title = Cloud Services ++ dropbox title = Dropbox menu = Dropbox probe = EchoPingHttp host = dl.dropboxusercontent.com port = 80 url = /u/12770892/benchmark/raspberrypi.jpg ++ gusercontent title = Google+ Photo menu = Google probe = EchoPingHttp host = lh4.googleusercontent.com port = 80 url = /UB5Y5yJKtj51bs2asd8kJGjOxwigev7JPQz3g9tw1C0=w614-h801-no