Lacking Rhoticity

Friday, 16 January 2009

Testing using golden files in Python

This is the third post in a series about automated testing in Python (earlier posts: 1, 2). This post is about testing using golden files.

A golden test is a fancy way of doing assertEquals() on a string or a directory tree, where the expected output is kept in a separate file or files -- the golden files. If the actual output does not match the expected output, the test runner can optionally run an interactive file comparison tool such as Meld to display the differences and allow you to selectively merge the differences into the golden file.

This is useful when

the data being checked is large - too large to embed into the Python source; or
the data contains relatively inconsequential details, such as boilerplate text or formatting, which might be changed frequently.

As an example, suppose you have code to format some data as HTML. You can have a test which creates some example data, formats this as HTML, and compares the result to a golden file. Something like this:

class ExampleTest(golden_test.GoldenTestCase):

    def test_formatting_html(self):
        obj = make_some_example_object()
        temp_dir = self.make_temp_dir()
        format_as_html(obj, temp_dir)
        self.assert_golden(temp_dir, os.path.join(os.path.dirname(__file__),
                                                  "golden-files"))

if __name__ == "__main__":
    golden_test.main()

By default, the test runs non-interactively, which is what you want on a continuous integration machine, and it will print a diff if it fails. To switch on the semi-interactive mode which runs Meld, you run the test with the option --meld.

Here is a simple version of the test helper (taken from here):

import os
import subprocess
import sys
import unittest

class GoldenTestCase(unittest.TestCase):

    run_meld = False

    def assert_golden(self, dir_got, dir_expect):
        assert os.path.exists(dir_expect), dir_expect
        proc = subprocess.Popen(["diff", "--recursive", "-u", "-N",
                                 "--exclude=.*", dir_expect, dir_got],
                                stdout=subprocess.PIPE)
        stdout, stderr = proc.communicate()
        if len(stdout) > 0:
            if self.run_meld:
                # Put expected output on the right because that is the
                # side we usually edit.
                subprocess.call(["meld", dir_got, dir_expect])
            raise AssertionError(
                "Differences from golden files found.\n"
                "Try running with --meld to update golden files.\n"
                "%s" % stdout)
        self.assertEquals(proc.wait(), 0)

def main():
    if sys.argv[1:2] == ["--meld"]:
        GoldenTestCase.run_meld = True
        sys.argv.pop(1)
    unittest.main()

(It's a bit cheesy to modify global state on startup to enable melding, but because unittest doesn't make it easy to pass parameters into tests this is the simplest way of doing it.)

Golden tests have the same sort of advantages that are associated with test-driven development in general.

Golden files are checked into version control and help to make changesets self-documenting. A changeset that affects the program's output will include patches that demonstrate how the output is affected. You can see the history of the program's output in version control. (This assumes that everyone runs the tests before committing!)
Sometimes you can point people at the golden files if they want to see example output. For HTML, sometimes you can contrive the CSS links to work so that the HTML looks right when viewed in a browser.
And of course, this can catch cases where you didn't intend to change the program's output.

My typical workflow is to add a test with some example input that I want the program to handle, and run the test with --meld until the output I want comes up on the left-hand side in Meld. I mark the output as OK by copying it over to the right-hand side. This is not quite test-first, because I am letting the test suggest what its expected output should be. But it is possible to do this in an even more test-first manner by typing the output you expect into the right-hand side.

Other times, one change will affect many locations in the golden files, and adding a new test is not necessary. It's usually not too difficult to quickly eyeball the differences with Meld.

Here are some of the things that I have used golden files to test:

formatting of automatically-generated e-mails
automatically generated configuration files
HTML formatting logs (build_log_test.py and its golden files)
pretty-printed output of X Windows messages in xjack-xcb (golden_test.py and golden_test_data). This ended up testing several components in one go:
- the XCB protocol definitions for X11
- the encoders and decoders that work off of the XCB protocol definitions
- the pretty printer for the decoder

On the other hand, sometimes the overhead of having a separate file isn't worth it, and if the expected output is small (maybe 5 lines or so) I might instead paste it into the Python source as a multiline string.

It can be tempting to overuse golden tests. As with any test suite, avoid creating one big example that tries to cover all cases. (This is particularly tempting if the test is slow to run.) Try to create smaller, separate examples. The test helper above is not so good at this, because if you are not careful it can end up running Meld several times. In the past I have put several examples into (from unittest's point of view) one test case, so that it runs Meld only once.

Golden tests might not work so well if components can be changed independently. For example, if your XML library changes its whitespace pretty-printing, the tests' output could change. This is less of a problem if your code is deployed with tightly-controlled versions of libraries, because you can just update the golden files when you upgrade libraries.

A note on terminology: I think I got the term "golden file" from my previous workplace, where other people were using them, and the term seems to enjoy some limited use judging from Google. "Golden test", however, may have been a term that I have made up and that no-one else outside my workplace is using for this meaning.

Sunday, 11 January 2009

On ABI and API compatibility

If you are going to create a new execution environment, such as a new OS, it can make things simpler if it presents the same interfaces as an existing system. It makes porting easier. So,

If you can, keep the ABI the same.
If you can't keep the ABI the same, at least keep the API the same.

Don't be tempted to say "we're changing X; we may as well take this opportunity to change Y, which has always bugged me". Only change things if there is a good reason.

For an example, let's look at the case of GNU Hurd.

In principle, the Hurd's glibc could present the same ABI as Linux's glibc (they share the same codebase, after all), but partly because of a different in threading libraries, they were made incompatible. Unifying the ABIs was planned, but it appears that 10 years later it has not happened (Hurd has a libc0.3 package instead of libc6).
Using the same ABI would have meant that the same executables would work on Linux and the Hurd. Debian would not have needed to rebuild all its packages for a separate "hurd-i386" architecture. It would have saved a lot of effort.
I suspect that if glibc were ported to the Hurd today, it would not be hard to make the ABIs the same. The threading code has changed a lot in the intervening time. I think it is cleaner now.
The Hurd's glibc also changed the API: they decided not to define PATH_MAX. The idea was that if there was a program that used fixed-length buffers for storing filenames, you'd be forced to fix it. Well, that wasn't a good idea. It just created unnecessary work. Hurd developers and users had enough on their plates without having to fix unrelated implementation quality issues in programs they wanted to use.

Similarly, it would help if glibc for NaCl (Native Client) could have the same ABI as glibc for Linux. Programs have to be recompiled for NaCl with nacl-gcc to meet the requirements of NaCl's validator, but could the resulting code still run directly on Linux, linked with the regular i386 Linux glibc and other libraries? The problem here is that code compiled for NaCl will expect all code addresses to be 32-byte-aligned. If the NaCl code is given a function address or a return address which is not 32-byte-aligned, things will go badly wrong when it tries to jump to it. Some background: In normal x86 code, to return from a function you write this:

ret

This pops an address off the stack and jumps to it. In NaCl, this becomes:

        popl %ecx
        and $0xffffffe0, %ecx
        jmp *%ecx

This pops an address off the stack, rounds it down to the nearest 32 byte boundary and jumps to it. If the calling function's call instruction was not placed at the end of a 32 byte block (which NaCl's assembler will arrange), the return address will not be aligned and this code will jump to the wrong location.

However, there is a way around this. We can get the NaCl assembler and linker to keep a list of all the places where a forcible alignment instruction (the and $0xffffffe0, %ecx above) was inserted, and put this list into the executable or library in a special section or segment. Then when we want to run the executable or library directly on Linux, we can rewrite all these locations so that the sequence above becomes

        popl %ecx
        nop
        nop
        jmp *%ecx

or maybe even just

        ret
        nop
        nop
        nop
        nop
        nop

We can reuse the relocations mechanism to store these rewrites. The crafty old linker already does something similar for thread-local variable accesses. When it knows that a thread-local variable is being accessed from the library where it is defined, it can rewrite the general-purpose-but-slow instruction sequence for TLS variable access into a faster instruction sequence. The general purpose instruction sequence even contains nops to allow for rewriting to the slightly-longer fast sequence.

This arrangement for running NaCl-compiled code could significantly simplify the process of building and testing code when porting it to NaCl. It can help us avoid the difficulties associated with cross-compiling.

Sunday, 4 January 2009

What does NaCl mean for Plash?

Google's Native Client (NaCl), announced last month, is an ingenious hack to get around the problem that existing OSes don't provide adequate security mechanisms for sandboxing native code.

You can look at NaCl as an interesting combination of OS-based and language-based security mechanisms:

NaCl uses a code verifier to prevent use of unsafe instructions such as those that perform system calls. This is not a million miles away from programming language subsets like Cajita and Joe-E, except that it operates at the level of x86 instructions rather than source code.
Since x86 instructions are variable-length and unaligned, NaCl has to stop you from jumping into an unsafe instruction hidden in the middle of a safe instruction. It does that by requiring that all indirect jumps are jumps to the start of 32-byte-aligned blocks; instructions are not allowed to straddle these blocks.
It uses the x86 architecture's little-used segmentation feature to limit memory accesses to a range of address space. So the processor is doing bounds checking for free.
Actually, segmentation has been used before - in L4 and EROS's "small spaces" facility for switching between processes with small address spaces without flushing the TLB. NaCl gets the same benefit: switching between trusted and untrusted code should be fast; faster than trapping system calls with ptrace(), for example.

Plash is also a hack to get sandboxing, specifically on Linux, but it has some limitations:

it doesn't block network access;
it doesn't limit CPU and memory resource usage, so sandboxed programs can still cause denial of service;
it requires a custom glibc, which can be a pain to build;
it changes the API/ABI that sandboxed programs see in some small but significant ways:
- some syscalls are effectively disabled; programs must go through libc for these calls, which stops statically linked programs from working;
- /proc/self doesn't work, and Plash's architecture makes it hard to emulate /proc.

Interface changes mean some programs require patching, e.g. to not depend on /proc. If there were more people behind Plash, these interface changes wouldn't be a big problem. These problems can be addressed, with work. But Plash hasn't really caught on, so the manpower isn't there.

NaCl also breaks the ABI - it breaks it totally. Code must be recompiled. However, NaCl provides bigger benefits in return. It allows programs to be deployed in new contexts: on Windows; in a web browser. It is more secure than Plash, because it can block network access and limit the amount of memory a process can allocate. Also, because NaCl mediates access more completely, it would be easier to emulate interfaces like /proc.

NaCl isn't only useful as a browser plugin. We could use it as a general purpose OS security mechanism. We could have GNU/Linux programs running on Windows (without the Linux bit).

Currently NaCl does not support all the features you'd need in a modern OS. In particular, dynamic linking. NaCl doesn't yet support loading code beyond an initial statically linked ELF executable. But we can add this. I am making a start at porting glibc, along with its dynamic linker. After all, I have ported glibc once before!

Wednesday, 17 December 2008

Helper for monkey patching in tests

Following on from my post on TempDirTestCase, here is another Python test case helper which I introduced at work. Our codebase has quite a few test cases which use monkey patching. That is, they temporarily modify a module or a class to replace a function or method for the duration of the test.

For example, you might want to monkey patch time.time so that it returns repeatable timestamps during the test. We have quite a lot of test cases that do something like this:

class TestFoo(unittest.TestCase):

    def setUp(self):
        self._old_time = time.time
        def monkey_time():
            return 0
        time.time = monkey_time

    def tearDown(self):
        time.time = self._old_time

    def test_foo(self):
        # body of test case

Having to save and restore the old values gets tedious, particularly if you have to monkey patch several objects (and, unfortunately, there are a few tests that monkey patch a lot). So I introduced a monkey_patch() method so that the code above can be simplified to:

class TestFoo(TestCase):

    def test_foo(self):
        self.monkey_patch(time, "time", lambda: 0)
        # body of test case

(OK, I'm cheating by using a lambda the second time around to make the code look shorter!)

Now, monkey patching is not ideal, and I would prefer not to have to use it. When I write new code I try to make sure that it can be tested without resorting to monkey patching. So, for example, I would parameterize the software under test to take time.time as an argument instead of getting it directly from the time module. (here's an example).

But sometimes you have to work with a codebase where most of the code is not covered by tests and is structured in such a way that adding tests is difficult. You could refactor the code to be more testable, but that risks changing its behaviour and breaking it. In that situation, monkey patching can be very useful. Once you have some tests, refactoring can become easier and less risky. It is then easier to refactor to remove the need for monkey patching -- although in practice it can be hard to justify doing that, because it is relatively invasive and might not be a big improvement, and so the monkey patching stays in.

Here's the code, an extended version of the base class from the earlier post:

import os
import shutil
import tempfile
import unittest

class TestCase(unittest.TestCase):

    def setUp(self):
        self._on_teardown = []

    def make_temp_dir(self):
        temp_dir = tempfile.mkdtemp(prefix="tmp-%s-" % self.__class__.__name__)
        def tear_down():
            shutil.rmtree(temp_dir)
        self._on_teardown.append(tear_down)
        return temp_dir

    def monkey_patch(self, obj, attr, new_value):
        old_value = getattr(obj, attr)
        def tear_down():
            setattr(obj, attr, old_value)
        self._on_teardown.append(tear_down)
        setattr(obj, attr, new_value)

    def monkey_patch_environ(self, key, value):
        old_value = os.environ.get(key)
        def tear_down():
            if old_value is None:
                del os.environ[key]
            else:
                os.environ[key] = old_value
        self._on_teardown.append(tear_down)
        os.environ[key] = value

    def tearDown(self):
        for func in reversed(self._on_teardown):
            func()

Wednesday, 26 November 2008

Shell features

Here are two features I would like to see in a Unix shell:

Timing: The shell should record how long each command takes. It should be able to show the start and stop times and durations of commands I have run in the past.
Finish notifications: When a long-running command finishes, the task bar icon for the shell's terminal window should flash, just as instant messaging programs flash their task bar icon when you receive a message. If the terminal window is tabbed, the tab should be highlighted too.

You can achieve the first with the time command, sure, but sometimes I start a command without knowing in advance that it will be long-running. It's hard to add the timer in afterwards. Also, Bash's time builtin stops working if you suspend a job with Ctrl-Z. It would be simpler if the shell collected this information by default.

The second feature requires some integration between the shell and the terminal. This could be done via some new terminal escape sequence or perhaps using the WINDOWID environment variable that gnome-terminal appears to pass to its subprocesses. But actually, I would prefer if the shell provided its own terminal window. There would be more scope for combining GUI and CLI features that way, such as displaying filename completions or (more usefully) command history in a pop-up window.

I have seen a couple of attempts to do that. Hotwire is one, but it is too different from Bash for my tastes. I would like a GUI shell that initially looks and can be used just like gnome-terminal + Bash. Gsh is closer to what I have in mind, but it is quite old, written in Tcl/Tk and C, and not complete.

Saturday, 22 November 2008

TempDirTestCase, a Python unittest helper

I have seen a lot of unittest-based test cases written in Python that create temporary directories in setUp() and delete them in tearDown().

Creating temporary directories is such a common thing to do that I have a base class that provides a make_temp_dir() helper method. As a result, you often don't have to define setUp() and tearDown() in your test cases at all.

I have ended up copying this into different codebases because it's easier than making the codebases depend on an external library for such a trivial piece of code. This seems to be quite common: lots of Python projects provide their own test runners based on unittest.

Here's the code:

import shutil
import tempfile
import unittest

class TempDirTestCase(unittest.TestCase):

    def setUp(self):
        self._on_teardown = []

    def make_temp_dir(self):
        temp_dir = tempfile.mkdtemp(prefix="tmp-%s-" % self.__class__.__name__)
        def tear_down():
            shutil.rmtree(temp_dir)
        self._on_teardown.append(tear_down)
        return temp_dir

    def tearDown(self):
        for func in reversed(self._on_teardown):
            func()

Monday, 27 October 2008

Making relocatable packages with JHBuild

I have been revisiting an experiment I began back in March with building GNOME with JHBuild. I wanted to see if it would be practical to use JHBuild to package all of GNOME with Zero-Install. The main issue with packaging anything with Zero-Install is relocatability.

If you build and install an autotools-based package with

./configure --prefix=/FOO && make install

the resulting files installed under /FOO will often have the pathname /FOO embedded in them, sometimes in text files, other times compiled into libraries and executables. This is a problem for Zero-Install because it runs under a normal user account and wants to install files into a user's home directory under ~/.cache. Currently if a program is to be packaged with Zero-Install it must be relocatable via environment variables. Compiling pathnames in is no good (at least without an environment variable override) because you don't know in advance where the program will be installed.

I found a few cases where pathnames get compiled in:

text: pkg-config .pc files
text: libtool .la files
text: shell scripts generated from .in files, such as gtkdocize, intltoolize and gettextize
binary: rpaths added by libtool

It is possible to handle these individual cases. Zero-Install's make-headers tool will fix up pkg-config .pc files. libtool .la files can apparently just be removed on Linux without any adverse effects. libtool could be modified to not use rpaths (unfortunately --disable-rpath doesn't seem to work), which are overridden by LD_LIBRARY_PATH anyway. gtkdocize et al could be modified. But that sounds like a lot of work. I'd like to get something working first.

In revisiting this I hoped that the only cases that would matter would be text files. It would be easy to do a search and replace inside text files to relocate packages. The idea would be to build with

/configure --prefix=/FAKEPREFIX
make install DESTDIR=/tempdest

and then rewrite /FAKEPREFIX to (say) /home/fred/realprefix. In a text file, changing the length of a pathname and changing the size of the file usually doesn't matter, but doing this to an ELF executable would completely screw the executable up. This search-and-replace trick would be a hack, but it would be worth trying.

It turned out that Evince (which I was using as a test case) embeds the pathname /FAKEPREFIX/share/applications/evince.desktop in its own executable, and if this file doesn't exist, it segfaults on startup.

Then it occurred to me that I could rewrite filenames inside binary files without changing the length of the filename: just pad the filenames out to a fixed length at the start.

So the idea now is to build with something like:

/configure --prefix=/home/bob/builddir/gtk-XXXXXXX
make install

and, when installing the files on another machine, rewrite

/home/bob/builddir/gtk-XXXXXXXXXXXXXXXXXXX

/home/fred/.cache/0install.net/gtk-XXXXXXX

Just make sure you start off with enough padding to allow the package to be relocated to any path a user is likely to use in practice.

This is even hackier than rewriting filenames inside text files, but it's very simple!

This is partly inspired by Nix, which does something similar, but with a bit more complexity. Nix will install a package under (something like) /nix/store/<hash>, where <hash> is (roughly) a cryptographic hash of the binary package's contents. But packages like Evince contain embedded filenames referring to their own contents, so Nix will build it with:

./configure --prefix=/nix/store/<made-up-hash>
make install

where <made-up-hash> is chosen randomly. Afterwards, the output is rewritten to replace <made-up-hash> with the real hash of the output, but there is some cleverness to discount <made-up-hash> from affecting the real hash.

(Earlier versions of Nix used the hash of the build input to identify packages rather than the hash of the build output. This avoided the need to do rewriting but didn't allow a package's contents to be verified based on its hash name.)

The fact that Nix uses this scheme successfully indicates that filename rewriting in binaries works, and filenames are not being checksummed or compressed or encoded in weird ways, which is good.

My plan now is:

Extend JHBuild to build packages into fixed-length prefixes and produce Zero-Install feeds. My work so far is on this Bazaar branch.
Extend Zero-Install to do filename rewriting inside files in order to relocate packages.