Sunday, May 3, 2009

Choosing the Operating System (SDI 07 part I)

For the first part of the discussion on how to set up a minimal software development infrastructure for a startup project, using only open-source software, we are looking at the lowest layer in the technology stack - hardware and operating system.

The first obvious reason for choosing an operating system for this development support server would be familiarity. If there is a particular OS or distribution the administrator is most familiar and comfortable with - this should probably be the most significant argument for choosing it.

At the time of this experiment, I did not have any recent experience with any particular OS for the last few years, so the choice would be based on what I could most likely set up most easily without much of a learning curve and where I could get help most easily when I run into problems.

The most obvious choice for an open-source operating system at this point is Linux, which runs pretty much on anything with a CPU - including almost any commodity PC hardware. For server platforms, my other preferred open-source operating system has typically been FreeBSD - which doesn't try to be anything else but a rock-solid server platform - but is a lot more picky when it comes to hardware and software support.

Even though not a very typical server hardware platform, the machine used for this experiment was going to be my mini tabletop server from AOpen. Linux would probably be my best bet to install and run on this type of hardware without too much trouble.

After choosing Linux, the next question is which distribution?

Assuming that an ambitious software project might have a development life-cycle in the order of 12 to 36 months, which is a very long time in the life of typical Linux distribution. We would like to assume that key systems like version control etc. could be set up at the beginning of the project and would not need to be touched or upgraded again during the most crucial initial development phase. If we need to do any upgrades down the line, we most likely would want these upgrades to be as minimal as possible. From past experience (admittedly with RPM mostly), the package management of most Linux distributions breaks down when trying to do point upgrades on a several year old system, which has not been kept up to date - sometimes just because packages are not archived for that long on a distributions website.

All of the major Linux distributions use some form of package management system for installing and upgrading optional software packages and for keeping track of the dependencies between packages. The most popular package management formats are RPM and DEB which are both based on distributing and installing binary packages. The odd one out among the top Linux distributions is Gentoo Linux, whose package management system, portage, is based on locally compiled source packages.

I am intrigued by the Gentoo portage package management system not for the usually claimed benefits like greater speed or better optimization, but by its potential to reduce non-essential dependencies. Dependencies often considered the root of evil in software package management...

Most open-source software packages themselves are extremely portable. They often not only build and compile from source on any Linux distribution but also most other Unix and Unix like systems sometimes even including MS Windows. One of the secrets behind this flexibility is for example the GNU autotools, which allows a package to probe and discover the existing system configuration and to configure its build to account for its current environment.

While most open-source software packages may have essential dependencies without which they cannot work, there are many optional dependencies which may be disabled if not needed. Once a package is built for a particular environment, much of that environment becomes a accidental or spurious dependencies for the resulting binary, which needs to be satisfied if we distribute a binary package.

This is just a hunch, but I would think that a source based package manager like portage should be able to get away with a lot less dependencies among packages than even the best ones based on binary packages. A non scientific sample comparison for Subversion between the gentoo-portage repository and the most popular Debian package system, seem to support that intuition:
the portage package has 4-5 mandatory direct dependencies and a few optional ones, which can be enabled or disabled during build, while the debian package is broken up into three different ones (subversion, libsvn1 and subversion-tools) with a dozen or more direct dependencies, not including some of the optional features from the portage package.

To further test this hypothesis, I have installed and upgraded a few packages on my now roughly two year old, out of date Gentoo Linux system without much of a problem. None of the typical problems like packages no longer available, incompatible with the system or causing a cascade of upgrades which might break other existing packages unless they are upgraded as well.

On the other hand, since I did not do the control experiment with a leading binary distribution to compare, who knows if it might have worked out here as easily.