Lacking Rhoticity: automated-testing

Showing posts with label automated-testing. Show all posts

Monday, 31 August 2009

How to do model checking of Python code

I read about eXplode in early 2007. The paper describes a neat technique, inspired by model checking, for exhaustively exploring all execution paths of a program. The difference from model checking is that it doesn't require a model of the program to be written in a special language; it can operate on a program written in a normal language. The authors of the paper use the technique to check the correctness of Linux filesystems, and version control systems layered on top of them, in the face of crashes.

The basic idea is to have a special function, choose(), that performs a non-deterministic choice, e.g.

   choose([1, 2, 3])

Conceptually, this forks execution into three branches, returning 1, 2 or 3 in each branch.

The program can call choose() multiple times, so you end up with a choice tree.

The job of the eXplode check runner is to explore that choice tree, and to try to find sequences of choices that make the program fail.

eXplode does not try to snapshot the state of the program when choose() is called. (That's what other model checkers have tried to do, and it's what would happen if choose() were implemented using the Unix fork() call.)

Instead, eXplode restores the state of the program by running it from the start and replaying a previously-recorded sequence of choices by returning them one-by-one from choose(). A node in the choice tree is represented by the sequence of choices required to get to it, and so to explore that node's subtree the check runner will repeatedly run the program from the start, replaying the choices for that subtree. The check runner does a breadth-first traversal of the choice tree, so it has a queue of nodes to visit, and each visit will add zero or more subnodes to the queue. (See the code below for a concrete implementation.)

This replay technique means that the program must be deterministic, otherwise the choice tree won't get explored properly. All non-determinism must occur through choose().

There seem to be two situations in which you might call choose():

You can call choose() from your test code to generate input.
In eXplode, the test program writes files and tries to read them back, and it uses choose() to pick what kind of writes it does.
So far I have used choose() in tests to generate input in two different ways:
In one case, I started with valid input and mangled it in different ways, to check the error handling of the software-under-test. The test code used choose() to pick a file to delete or line to remove. The idea was that the software-under-test should give a nicely-formatted error message, specifying the faulty file, rather than a Python traceback. There was only a bounded number of faults to introduce so the choice tree was finite.
In another case, I built up XML trees to check that the software under test could process them. In this case the choice tree is infinite. I decided to prune it arbitrarily at a certain depth so that the test could run as part of a fairly quickly-running test suite.
In both these cases, using the eXplode technique wasn't strictly necessary. It would have been possible to write the tests without choose(), perhaps by copying the inputs before modifying them. But I think choose() makes the test more concise, and not all objects have interfaces for making copies.
You can also call choose() from the software-under-test to introduce deliberate faults, or from a fake object that the software-under-test is instantiated with.
In eXplode, the filesystem is tested with a fake block device which tries to simulate all the possible failures modes of a real hard disc (note that Flash drives have different failure modes!). The fake device uses choose() to decide when to simulate a crash and restart, and which blocks should be successfully written to the virtual device when the crash happens.
You can also wrap malloc() so that it non-deterministically returns NULL in order to test that software handles out-of-memory conditions gracefully. Testing error paths is difficult to do and this looks like a good way of doing it.
In these cases, the calls to choose() can be buried deeply in the call stack, and the eXplode check runner is performing an inversion of control that would be hard to do in a conventional test case.

Here's an example of how to implement the eXplode technique in Python:


import unittest


# If anyone catches and handles this, it will break the checking model.
class ModelCheckEscape(Exception):

    pass


class Chooser(object):

    def __init__(self, chosen, queue):
        self._so_far = chosen
        self._index = 0
        self._queue = queue

    def choose(self, choices):
        if self._index < len(self._so_far):
            choice = self._so_far[self._index]
            if choice not in choices:
                raise Exception("Program is not deterministic")
            self._index += 1
            return choice
        else:
            for choice in choices:
                self._queue.append(self._so_far + [choice])
            raise ModelCheckEscape()


def check(func):
    queue = [[]]
    while len(queue) > 0:
        chosen = queue.pop(0)
        try:
            func(Chooser(chosen, queue))
        except ModelCheckEscape:
            pass
        # Can catch other exceptions here and report the failure
        # - along with the choices that caused it - and then carry on.


class ModelCheckTest(unittest.TestCase):

    def test(self):
        got = []

        def func(chooser):
            # Example of how to generate a Cartesian product using choose():
            v1 = chooser.choose(["a", "b", "c"])
            v2 = chooser.choose(["x", "y", "z"])
            # We don't normally expect the function being tested to
            # capture state - in this case the "got" list - but we
            # do it here for testing purposes.
            got.append(v1 + v2)

        check(func)
        self.assertEquals(sorted(got),
                          sorted(["ax", "ay", "az",
                                  "bx", "by", "bz",
                                  "cx", "cy", "cz"]))


if __name__ == "__main__":
    unittest.main()

Friday, 16 January 2009

Testing using golden files in Python

This is the third post in a series about automated testing in Python (earlier posts: 1, 2). This post is about testing using golden files.

A golden test is a fancy way of doing assertEquals() on a string or a directory tree, where the expected output is kept in a separate file or files -- the golden files. If the actual output does not match the expected output, the test runner can optionally run an interactive file comparison tool such as Meld to display the differences and allow you to selectively merge the differences into the golden file.

This is useful when

the data being checked is large - too large to embed into the Python source; or
the data contains relatively inconsequential details, such as boilerplate text or formatting, which might be changed frequently.

As an example, suppose you have code to format some data as HTML. You can have a test which creates some example data, formats this as HTML, and compares the result to a golden file. Something like this:

class ExampleTest(golden_test.GoldenTestCase):

    def test_formatting_html(self):
        obj = make_some_example_object()
        temp_dir = self.make_temp_dir()
        format_as_html(obj, temp_dir)
        self.assert_golden(temp_dir, os.path.join(os.path.dirname(__file__),
                                                  "golden-files"))

if __name__ == "__main__":
    golden_test.main()

By default, the test runs non-interactively, which is what you want on a continuous integration machine, and it will print a diff if it fails. To switch on the semi-interactive mode which runs Meld, you run the test with the option --meld.

Here is a simple version of the test helper (taken from here):

import os
import subprocess
import sys
import unittest

class GoldenTestCase(unittest.TestCase):

    run_meld = False

    def assert_golden(self, dir_got, dir_expect):
        assert os.path.exists(dir_expect), dir_expect
        proc = subprocess.Popen(["diff", "--recursive", "-u", "-N",
                                 "--exclude=.*", dir_expect, dir_got],
                                stdout=subprocess.PIPE)
        stdout, stderr = proc.communicate()
        if len(stdout) > 0:
            if self.run_meld:
                # Put expected output on the right because that is the
                # side we usually edit.
                subprocess.call(["meld", dir_got, dir_expect])
            raise AssertionError(
                "Differences from golden files found.\n"
                "Try running with --meld to update golden files.\n"
                "%s" % stdout)
        self.assertEquals(proc.wait(), 0)

def main():
    if sys.argv[1:2] == ["--meld"]:
        GoldenTestCase.run_meld = True
        sys.argv.pop(1)
    unittest.main()

(It's a bit cheesy to modify global state on startup to enable melding, but because unittest doesn't make it easy to pass parameters into tests this is the simplest way of doing it.)

Golden tests have the same sort of advantages that are associated with test-driven development in general.

Golden files are checked into version control and help to make changesets self-documenting. A changeset that affects the program's output will include patches that demonstrate how the output is affected. You can see the history of the program's output in version control. (This assumes that everyone runs the tests before committing!)
Sometimes you can point people at the golden files if they want to see example output. For HTML, sometimes you can contrive the CSS links to work so that the HTML looks right when viewed in a browser.
And of course, this can catch cases where you didn't intend to change the program's output.

My typical workflow is to add a test with some example input that I want the program to handle, and run the test with --meld until the output I want comes up on the left-hand side in Meld. I mark the output as OK by copying it over to the right-hand side. This is not quite test-first, because I am letting the test suggest what its expected output should be. But it is possible to do this in an even more test-first manner by typing the output you expect into the right-hand side.

Other times, one change will affect many locations in the golden files, and adding a new test is not necessary. It's usually not too difficult to quickly eyeball the differences with Meld.

Here are some of the things that I have used golden files to test:

formatting of automatically-generated e-mails
automatically generated configuration files
HTML formatting logs (build_log_test.py and its golden files)
pretty-printed output of X Windows messages in xjack-xcb (golden_test.py and golden_test_data). This ended up testing several components in one go:
- the XCB protocol definitions for X11
- the encoders and decoders that work off of the XCB protocol definitions
- the pretty printer for the decoder

On the other hand, sometimes the overhead of having a separate file isn't worth it, and if the expected output is small (maybe 5 lines or so) I might instead paste it into the Python source as a multiline string.

It can be tempting to overuse golden tests. As with any test suite, avoid creating one big example that tries to cover all cases. (This is particularly tempting if the test is slow to run.) Try to create smaller, separate examples. The test helper above is not so good at this, because if you are not careful it can end up running Meld several times. In the past I have put several examples into (from unittest's point of view) one test case, so that it runs Meld only once.

Golden tests might not work so well if components can be changed independently. For example, if your XML library changes its whitespace pretty-printing, the tests' output could change. This is less of a problem if your code is deployed with tightly-controlled versions of libraries, because you can just update the golden files when you upgrade libraries.

A note on terminology: I think I got the term "golden file" from my previous workplace, where other people were using them, and the term seems to enjoy some limited use judging from Google. "Golden test", however, may have been a term that I have made up and that no-one else outside my workplace is using for this meaning.

Wednesday, 17 December 2008

Helper for monkey patching in tests

Following on from my post on TempDirTestCase, here is another Python test case helper which I introduced at work. Our codebase has quite a few test cases which use monkey patching. That is, they temporarily modify a module or a class to replace a function or method for the duration of the test.

For example, you might want to monkey patch time.time so that it returns repeatable timestamps during the test. We have quite a lot of test cases that do something like this:

class TestFoo(unittest.TestCase):

    def setUp(self):
        self._old_time = time.time
        def monkey_time():
            return 0
        time.time = monkey_time

    def tearDown(self):
        time.time = self._old_time

    def test_foo(self):
        # body of test case

Having to save and restore the old values gets tedious, particularly if you have to monkey patch several objects (and, unfortunately, there are a few tests that monkey patch a lot). So I introduced a monkey_patch() method so that the code above can be simplified to:

class TestFoo(TestCase):

    def test_foo(self):
        self.monkey_patch(time, "time", lambda: 0)
        # body of test case

(OK, I'm cheating by using a lambda the second time around to make the code look shorter!)

Now, monkey patching is not ideal, and I would prefer not to have to use it. When I write new code I try to make sure that it can be tested without resorting to monkey patching. So, for example, I would parameterize the software under test to take time.time as an argument instead of getting it directly from the time module. (here's an example).

But sometimes you have to work with a codebase where most of the code is not covered by tests and is structured in such a way that adding tests is difficult. You could refactor the code to be more testable, but that risks changing its behaviour and breaking it. In that situation, monkey patching can be very useful. Once you have some tests, refactoring can become easier and less risky. It is then easier to refactor to remove the need for monkey patching -- although in practice it can be hard to justify doing that, because it is relatively invasive and might not be a big improvement, and so the monkey patching stays in.

Here's the code, an extended version of the base class from the earlier post:

import os
import shutil
import tempfile
import unittest

class TestCase(unittest.TestCase):

    def setUp(self):
        self._on_teardown = []

    def make_temp_dir(self):
        temp_dir = tempfile.mkdtemp(prefix="tmp-%s-" % self.__class__.__name__)
        def tear_down():
            shutil.rmtree(temp_dir)
        self._on_teardown.append(tear_down)
        return temp_dir

    def monkey_patch(self, obj, attr, new_value):
        old_value = getattr(obj, attr)
        def tear_down():
            setattr(obj, attr, old_value)
        self._on_teardown.append(tear_down)
        setattr(obj, attr, new_value)

    def monkey_patch_environ(self, key, value):
        old_value = os.environ.get(key)
        def tear_down():
            if old_value is None:
                del os.environ[key]
            else:
                os.environ[key] = old_value
        self._on_teardown.append(tear_down)
        os.environ[key] = value

    def tearDown(self):
        for func in reversed(self._on_teardown):
            func()

Saturday, 22 November 2008

TempDirTestCase, a Python unittest helper

I have seen a lot of unittest-based test cases written in Python that create temporary directories in setUp() and delete them in tearDown().

Creating temporary directories is such a common thing to do that I have a base class that provides a make_temp_dir() helper method. As a result, you often don't have to define setUp() and tearDown() in your test cases at all.

I have ended up copying this into different codebases because it's easier than making the codebases depend on an external library for such a trivial piece of code. This seems to be quite common: lots of Python projects provide their own test runners based on unittest.

Here's the code:

import shutil
import tempfile
import unittest

class TempDirTestCase(unittest.TestCase):

    def setUp(self):
        self._on_teardown = []

    def make_temp_dir(self):
        temp_dir = tempfile.mkdtemp(prefix="tmp-%s-" % self.__class__.__name__)
        def tear_down():
            shutil.rmtree(temp_dir)
        self._on_teardown.append(tear_down)
        return temp_dir

    def tearDown(self):
        for func in reversed(self._on_teardown):
            func()

Lacking Rhoticity