Monday 7 December 2009

Modular vs. monolithic build systems

I have spent some time packaging software as Debian packages. While the Debian packaging system has its faults, it has the nice property that it is modular. This post is an attempt to articulate what aspects of Debian packaging -- and other modular build systems -- are worth replicating, and why it is worth co-operating with these systems rather than ignoring them or working around them.

What is a modular build?

  1. A modular build consists of a set of modules,
  2. each of which can be built separately.
  3. Each module's build produces some output (a directory tree).
  4. A module may depend on the outputs of other modules, but it can't reach inside the others' build trees.
  5. There is a common interface that each module provides for building itself.
  6. The build tool can be replaced with another. The description of the module set is separate from the modules themselves.

What is a non-modular, monolithic build?

  1. A monolithic build consists of one big build tree.
  2. Any part of the build can reference any other part via relative filenames.
  3. It might consist of multiple checkouts from version control, but they have to be checked out to specific directory tree locations (as in the Chromium build).

Some examples of modular builds:

  • Build systems/tools:
    • Debian packages (and presumably RPMs too)
    • JHBuild
    • Zero-Install
    • Nix
  • Module interfaces:
    • GNU autotools (./configure && make && make install)
    • Python distutils (setup.py)
  • Software collections:
    • GNOME
    • Xorg (7.0 onwards)
    • Sugar
Examples of monolithic builds:
  • XFree86 (and Xorg 6.9): Before Xorg was modularised, there was a big makefile that built everything, from Xlib to the X server to example X clients.
  • Chromium web browser: This uses a tool called "gyp" to generate a big makefile which compiles individual source files from several libraries, including WebKit, V8 and the Native Client IPC library. It ignores WebKit's own build system.
  • Native Client: One SCons build builds the core code as well as the NPAPI browser plugin and example code; it needs to know how to cross-compile NaCl code as well as compile host system code. Another makefile builds the compiler toolchain from tarballs and patch files that are checked into SVN.
  • CPython: The standard library builds many Python C extensions.

Modular build systems offer a number of advantages:

  • You can download and build only the parts you need. This can be a big help if some modules are huge but seldom change while the modules you work on are small and fast to build.
  • Some systems (such as Debian packages) give you binary packages so you don't need to build the dependencies of the modules that you want to work on. JHBuild doesn't provide this but it could be achieved with a little work.
  • Dependencies are clearer.
  • External interfaces are clearer too.
  • It is possible to change one module's version independently of other modules (to the extent that differing versions are compatible).
  • They are relatively easy to use in a decentralised way. It is easy to create a new version of a module set which adds or removes modules.
  • You don't have to check huge dependencies into your version control system. Some projects check in monster tarballs or source trees, which dwarf the project's own code. If you avoid this practice you will make it easier for distributions to package your software.

The two categories can coexist: Each module may internally be a monolithic build which can be arbitrarily complex. Autotools is an example of that. This is not too bad because at least we have contained the complexity within the module. The layer on top, which connects modules together, can be relatively simple.

Despite its faults, autotools is very amenable to being part of a modular build:

  • The build tree does not need to be kept around after doing "make install".
  • Output can be directed using "--prefix=foo" and "make install DESTDIR=foo".
  • Inputs can be specified via --prefix and PATH and other environment variables.
  • The build tree can be separate from the source tree. It's easy to have multiple build trees with different build options.

The systems I listed as modular all have their own problems. The main problem with Debian packages is that they are installed system-wide, which requires root access and makes it difficult to install multiple versions of a package. It is possible to work around this problem using chroots. JHBuild, Zero-Install and Nix avoid this problem. JHBuild and Zero-Install are not so good at capturing immutable snapshots of package sets. Nix is good at capturing snapshots, but Nix makes it difficult to change a library without rebuilding everything that uses it.

Despite these problems, these systems have a nice property: they are layered. It is possible to mix and match modules and replace the build layer. Hence it is possible to build Xorg and GNOME either with JHBuild or as Debian packages. In turn, there is a choice of tools for building Debian source packages. There is even a tool for making sets of Debian packages from JHBuild module descriptions.

These systems do not interoperate perfectly, but they do work and scale.

There are some arguments for having a monolithic system. In some situations it is difficult to split pieces of software into separately-built modules. For example, Plash-glibc is currently built by symlinking the source for the Plash IPC library into the glibc source tree, so that glibc builds it with the correct compiler flags and with the glibc internal header files. Ideally the IPC library would be built as a separate module, but for now it is better not to.

Still, if you can find good module boundaries, it is a good idea to take advantage of them.

Tuesday 1 December 2009

read_file() and write_file() in Python

Here are two functions I would like to see in the Python standard library:
def read_file(filename):
  fh = open(filename, "r")
  try:
      return fh.read()
  finally:
      fh.close()

def write_file(filename, data):
  fh = open(filename, "w")
  try:
      fh.write(data)
  finally:
      fh.close()

In the first case, it is possible to do open(filename).read() instead of read_file(filename). However, this is not good practice because if you are using a Python implementation that uses a garbage collection implementation other than reference counting, there will be a delay before the Python file object, and the file descriptor it wraps, are freed. So you will temporarily leak a scarce resource.

I have sometimes argued that GC should really know about file descriptors: If the open() syscall fails with EMFILE or ENFILE, the Python standard library should run GC and retry. (However this is probably not going to work for reliably receiving FDs in messages using recvmsg().) But in the case of read_file() it is really easy to inform the system of the lifetime of the FD using close(), so there is no excuse not to do so!

In the second case, again it is possible to do open(filename, "w").write(data) instead of write_file(filename, data), but in the presence of non-refcounting GC it will not only temporarily leak an FD, it will also do the wrong thing! The danger is that the file's contents will be only partially written, because Python file objects are buffered, and the buffer is not flushed until you do close() or flush(). If you use open() this way and test only on CPython, you won't find out that your program breaks on Jython, IronPython or PyPy (all of which normally use non-refcounting GC, as far as I know).

Maybe CPython should be changed so that the file object's destructor does not flush the buffer. That would break some programs but expose the problem. Maybe it could be optional.

read_file() and write_file() calls also look nicer. I don't want the ugliness of open-coding these functions.

Don't write code like this:

fh = open(os.path.join(temp_dir, "foo"), "w")
fh.write(blah)
fh.close()
fh = open(os.path.join(temp_dir, "bar"), "w")
fh.write(stuff)
fh.close()

Instead, write it like this:

write_file(os.path.join(temp_dir, "foo"), blah)
write_file(os.path.join(temp_dir, "bar"), stuff)

Until Python includes these functions in the standard library, I will copy and paste them into every Python codebase I work on in protest! They are too trivial, in my opinion, to be worth the trouble of depending on an external library.

I'd like to see these functions as builtins, because they should be as readily available as open(), for which they are alternatives. However if they have to be imported from a module I won't complain!