Lacking Rhoticity: python

Showing posts with label python. Show all posts

Tuesday, 1 December 2009

read_file() and write_file() in Python

Here are two functions I would like to see in the Python standard library:

def read_file(filename):
  fh = open(filename, "r")
  try:
      return fh.read()
  finally:
      fh.close()

def write_file(filename, data):
  fh = open(filename, "w")
  try:
      fh.write(data)
  finally:
      fh.close()

In the first case, it is possible to do open(filename).read() instead of read_file(filename). However, this is not good practice because if you are using a Python implementation that uses a garbage collection implementation other than reference counting, there will be a delay before the Python file object, and the file descriptor it wraps, are freed. So you will temporarily leak a scarce resource.

I have sometimes argued that GC should really know about file descriptors: If the open() syscall fails with EMFILE or ENFILE, the Python standard library should run GC and retry. (However this is probably not going to work for reliably receiving FDs in messages using recvmsg().) But in the case of read_file() it is really easy to inform the system of the lifetime of the FD using close(), so there is no excuse not to do so!

In the second case, again it is possible to do open(filename, "w").write(data) instead of write_file(filename, data), but in the presence of non-refcounting GC it will not only temporarily leak an FD, it will also do the wrong thing! The danger is that the file's contents will be only partially written, because Python file objects are buffered, and the buffer is not flushed until you do close() or flush(). If you use open() this way and test only on CPython, you won't find out that your program breaks on Jython, IronPython or PyPy (all of which normally use non-refcounting GC, as far as I know).

Maybe CPython should be changed so that the file object's destructor does not flush the buffer. That would break some programs but expose the problem. Maybe it could be optional.

read_file() and write_file() calls also look nicer. I don't want the ugliness of open-coding these functions.

Don't write code like this:

fh = open(os.path.join(temp_dir, "foo"), "w")
fh.write(blah)
fh.close()
fh = open(os.path.join(temp_dir, "bar"), "w")
fh.write(stuff)
fh.close()

Instead, write it like this:

write_file(os.path.join(temp_dir, "foo"), blah)
write_file(os.path.join(temp_dir, "bar"), stuff)

Until Python includes these functions in the standard library, I will copy and paste them into every Python codebase I work on in protest! They are too trivial, in my opinion, to be worth the trouble of depending on an external library.

I'd like to see these functions as builtins, because they should be as readily available as open(), for which they are alternatives. However if they have to be imported from a module I won't complain!

Wednesday, 18 November 2009

CapPython revisited

[I've had a draft of this post floating around since June, and it's about time I posted it.]

Guido van Rossum's criticisms of CapPython from earlier in the year are largely right, though I would put the emphasis differently. A lot of existing Python code will not pass CapPython's static verifier. But I was not expecting large amounts of code to pass the verifier.

CapPython was an experiment to see if Python could be tamed using only a static verifier - similar to Joe-E, a subset of Java - without resorting to automated rewriting of code. I had a hunch that it was possible. I also wanted a concrete system to show why I preferred to avoid some of the Python constructs that CapPython prohibits. Some of Joe-E's restrictions are also quite severe, such as prohibiting try...finally in order to prevent non-deterministic execution (something, incidentally, that CapPython does not try to do). It turned out that it was better to abandon the goal of being static-only when it came to loading Python modules.

My goal was that CapPython code should look like fairly idiomatic Python code (and be able to run directly under Python), not that all or most idiomatic Python code would be valid CapPython code. As a consequence, I didn't want CapPython to require special declarations, such as decorators on classes, for making objects encapsulated. Instead I wanted objects to be encapsulated by default. Maybe that was an arbitrary goal.

CapPython managed to meet that goal, but only partially, and only by depending on Python 2.x's unbound methods feature, which was removed in Python 3.0.

The reason my goal was met only partially is that in some circumstances it is necessary to wrap class objects (using a decorator such as @sealed in my earlier post) so that the class's constructor is exposed but inheritance is blocked. Whether this is necessary depends on whether the class object is authorityless. A class object is authorityless if it does not reference any authority-carrying objects (directly, or indirectly through objects captured by method function closures). If a class object is authorityless, it should be safe to grant other, untrusted modules the ability to create derived classes. Otherwise, the class object ought to be wrapped. The problem is that it is not trivial to make this determination. Furthermore, permitting inheritance is the default, and this is not a safe default.

If we have to go around putting @sealed decorators on all class definitions because the non-@sealed default is not-quite-encapsulated, we may as well consider alternatives where the default is not-at-all-encapsulated.

The main alternative is to resurrect CPython's restricted mode and add a wrapper type for making encapsulated objects (i.e. a non-broken version of Bastion along the lines of my earlier post).

If modifying CPython is considered acceptable, and one's goal is to add object-capabilities to Python in the most practical way (rather than to investigate a specific approach), resurrecting restricted mode is probably a better way to do it than CapPython.

Friday, 18 September 2009

Logging with stack traces

I find that postmortem debugging is useful (e.g. examining process state in gdb or pdb after a crash or traceback), but trying to debug using breakpoints is an exercise in tedium. I prefer debugging using logging. However, logs obtained just by adding printf statements tend to lack context information, and if you add lots of printfs the logs require more effort to read.

You can get context information in Python by doing traceback.print_stack() after every message, but the result tends to be unreadable. Below is a tool which prints stack context in a more readable tree format. I found it very useful a couple of months ago when investigating a bug in which a recursive invocation of a callback function caused failures later on. Had I not seen the stack context, I might not have realised that the callback was being called recursively.

Here is some example output. It comes from running the Tahoe client, while monkey-patching some socket functions to show which sockets the process is binding and connecting to.

0:/usr/bin/twistd:21:<module>
 1:/usr/lib/python2.5/site-packages/twisted/scripts/twistd.py:27:run
  2:/usr/lib/python2.5/site-packages/twisted/application/app.py:379:run
   3:/usr/lib/python2.5/site-packages/twisted/scripts/twistd.py:23:runApp
    4:/usr/lib/python2.5/site-packages/twisted/application/app.py:158:run
     5:/usr/lib/python2.5/site-packages/twisted/scripts/_twistd_unix.py:213:postApplication
      6:/usr/lib/python2.5/site-packages/twisted/scripts/_twistd_unix.py:174:startApplication
       7:/usr/lib/python2.5/site-packages/twisted/application/service.py:228:privilegedStartService
        8:/usr/lib/python2.5/site-packages/twisted/application/service.py:228:privilegedStartService
         9:/usr/lib/python2.5/site-packages/twisted/application/service.py:228:privilegedStartService
          10:/usr/lib/python2.5/site-packages/twisted/application/internet.py:68:privilegedStartService
           11:/usr/lib/python2.5/site-packages/twisted/application/internet.py:86:_getPort
            12:/usr/lib/python2.5/site-packages/twisted/internet/posixbase.py:467:listenTCP
             13:/usr/lib/python2.5/site-packages/twisted/internet/tcp.py:730:startListening
              14:/usr/lib/python2.5/site-packages/twisted/internet/tcp.py:718:createInternetSocket
               15:/usr/lib/python2.5/site-packages/twisted/internet/base.py:724:createInternetSocket
                16:tahoe-client.tac:55:__init__
                 create socket <Socket object at 0x8eb580c> (2, 1)
              14:tahoe-client.tac:64:bind
               bind <Socket object at 0x8eb580c> (('', 56530),)
foolscap.pb.Listener starting on 56530
         9:/usr/lib/python2.5/site-packages/twisted/application/service.py:228:privilegedStartService
          10:/usr/lib/python2.5/site-packages/twisted/application/internet.py:68:privilegedStartService
           11:/usr/lib/python2.5/site-packages/twisted/application/internet.py:86:_getPort
            12:/usr/lib/python2.5/site-packages/twisted/internet/posixbase.py:467:listenTCP
             13:/usr/lib/python2.5/site-packages/twisted/internet/tcp.py:730:startListening
              14:/usr/lib/python2.5/site-packages/twisted/internet/tcp.py:718:createInternetSocket
               15:/usr/lib/python2.5/site-packages/twisted/internet/base.py:724:createInternetSocket
                16:tahoe-client.tac:55:__init__
                 create socket <Socket object at 0x8eb5844> (2, 1)
              14:tahoe-client.tac:64:bind
               bind <Socket object at 0x8eb5844> (('0.0.0.0', 3456),)
nevow.appserver.NevowSite starting on 3456
Starting factory <nevow.appserver.NevowSite instance at 0x8eb07ac>
My pid: 27115
      6:/usr/lib/python2.5/site-packages/twisted/application/app.py:113:runReactorWithLogging
       7:/usr/lib/python2.5/site-packages/twisted/internet/posixbase.py:220:run
        8:/usr/lib/python2.5/site-packages/twisted/internet/posixbase.py:228:mainLoop
         9:/usr/lib/python2.5/site-packages/twisted/internet/base.py:561:runUntilCurrent
          10:/usr/lib/python2.5/site-packages/foolscap/eventual.py:26:_turn
           11:/usr/lib/python2.5/site-packages/allmydata/node.py:259:_startService
            12:/usr/lib/python2.5/site-packages/twisted/internet/defer.py:191:addCallback
             13:/usr/lib/python2.5/site-packages/twisted/internet/defer.py:182:addCallbacks
              14:/usr/lib/python2.5/site-packages/twisted/internet/defer.py:317:_runCallbacks
               15:/usr/lib/python2.5/site-packages/allmydata/node.py:259:<lambda>
                16:/usr/lib/python2.5/site-packages/allmydata/util/iputil.py:93:get_local_addresses_async
                 17:/usr/lib/python2.5/site-packages/allmydata/util/iputil.py:137:get_local_ip_for
                  18:/usr/lib/python2.5/site-packages/twisted/internet/posixbase.py:388:listenUDP
                   19:/usr/lib/python2.5/site-packages/twisted/internet/udp.py:84:startListening
                    20:/usr/lib/python2.5/site-packages/twisted/internet/udp.py:89:_bindSocket
                     21:/usr/lib/python2.5/site-packages/twisted/internet/base.py:724:createInternetSocket
                      22:tahoe-client.tac:55:__init__
                       create socket <Socket object at 0x8eb58ec> (2, 2)
                     21:tahoe-client.tac:64:bind
                      bind <Socket object at 0x8eb58ec> (('', 0),)
twisted.internet.protocol.DatagramProtocol starting on 53017
Starting protocol <twisted.internet.protocol.DatagramProtocol instance at 0x8eb0dcc>
                  18:/usr/lib/python2.5/site-packages/twisted/internet/udp.py:181:connect
                   19:tahoe-client.tac:58:connect
                    connect <Socket object at 0x8eb58ec> (('198.41.0.4', 7),)
(Port 53017 Closed)
Stopping protocol <twisted.internet.protocol.DatagramProtocol instance at 0x8eb0dcc>
         9:/usr/lib/python2.5/site-packages/twisted/internet/base.py:561:runUntilCurrent
          10:/usr/lib/python2.5/site-packages/foolscap/eventual.py:26:_turn
           11:/usr/lib/python2.5/site-packages/twisted/internet/defer.py:239:callback
            12:/usr/lib/python2.5/site-packages/twisted/internet/defer.py:304:_startRunCallbacks
             13:/usr/lib/python2.5/site-packages/twisted/internet/defer.py:317:_runCallbacks
              14:/usr/lib/python2.5/site-packages/allmydata/client.py:136:_start_introducer_client
               15:/usr/lib/python2.5/site-packages/twisted/application/service.py:148:setServiceParent
                16:/usr/lib/python2.5/site-packages/twisted/application/service.py:260:addService
                 17:/usr/lib/python2.5/site-packages/allmydata/introducer/client.py:58:startService
                  18:/usr/lib/python2.5/site-packages/foolscap/pb.py:901:connectTo
                   19:/usr/lib/python2.5/site-packages/foolscap/reconnector.py:60:startConnecting
                    20:/usr/lib/python2.5/site-packages/foolscap/reconnector.py:87:_connect
                     21:/usr/lib/python2.5/site-packages/foolscap/pb.py:830:getReference
                      22:/usr/lib/python2.5/site-packages/twisted/internet/defer.py:107:maybeDeferred
                       23:/usr/lib/python2.5/site-packages/foolscap/pb.py:856:_getReference
                        24:/usr/lib/python2.5/site-packages/foolscap/pb.py:947:getBrokerForTubRef
                         25:/usr/lib/python2.5/site-packages/foolscap/negotiate.py:1332:connect
                          26:/usr/lib/python2.5/site-packages/foolscap/negotiate.py:1359:connectToAll
                           27:/usr/lib/python2.5/site-packages/twisted/internet/posixbase.py:474:connectTCP
                            28:/usr/lib/python2.5/site-packages/twisted/internet/base.py:665:connect
                             29:/usr/lib/python2.5/site-packages/twisted/internet/tcp.py:885:_makeTransport
                              30:/usr/lib/python2.5/site-packages/twisted/internet/tcp.py:593:__init__
                               31:/usr/lib/python2.5/site-packages/twisted/internet/tcp.py:497:createInternetSocket
                                32:tahoe-client.tac:55:__init__
                                 create socket <Socket object at 0x8eb5a74> (2, 1)
                           27:/usr/lib/python2.5/site-packages/twisted/internet/posixbase.py:474:connectTCP
                            28:/usr/lib/python2.5/site-packages/twisted/internet/base.py:665:connect
                             29:/usr/lib/python2.5/site-packages/twisted/internet/tcp.py:885:_makeTransport
                              30:/usr/lib/python2.5/site-packages/twisted/internet/tcp.py:593:__init__
                               31:/usr/lib/python2.5/site-packages/twisted/internet/tcp.py:497:createInternetSocket
                                32:tahoe-client.tac:55:__init__
                                 create socket <Socket object at 0x8c59e64> (2, 1)
         9:/usr/lib/python2.5/site-packages/twisted/internet/base.py:561:runUntilCurrent
          10:/usr/lib/python2.5/site-packages/twisted/internet/tcp.py:506:resolveAddress
           11:/usr/lib/python2.5/site-packages/twisted/internet/tcp.py:513:_setRealAddress
            12:/usr/lib/python2.5/site-packages/twisted/internet/tcp.py:539:doConnect
             13:tahoe-client.tac:61:connect_ex
              connect_ex <Socket object at 0x8eb5a74> (('127.0.0.1', 45651),)
          10:/usr/lib/python2.5/site-packages/twisted/internet/tcp.py:506:resolveAddress
           11:/usr/lib/python2.5/site-packages/twisted/internet/tcp.py:513:_setRealAddress
            12:/usr/lib/python2.5/site-packages/twisted/internet/tcp.py:539:doConnect
             13:tahoe-client.tac:61:connect_ex
              connect_ex <Socket object at 0x8c59e64> (('192.168.20.244', 45651),)

Here is the code for the logging tool:

import sys

laststack = []

def tlog(*msgs):
    """Outputs a log message with stack trace context."""
    global laststack
    try:
        raise Exception()
    except Exception:
        frame = sys.exc_info()[2].tb_frame.f_back
    stack = extract(frame)
    for i, fr in enumerate(stack):
        if not (i < len(laststack) and laststack[i] is fr):
            lineno = fr.f_lineno
            co = fr.f_code
            filename = co.co_filename
            name = co.co_name
            print " " * i + ("%i:%s:%s:%s" % (i, filename, lineno, name))
    print " " * len(stack) + " ".join(map(str, msgs))
    laststack = stack

def extract(frame):
    got = []
    while frame is not None:
        got.append(frame)
        frame = frame.f_back
    got.reverse()
    return got

This makes use of the fact that Python's stack frames are heap-allocated, so they have identity, and it is possible to hold on to a stack frame object after the frame has exited. This means it is possible to distinguish between two different invocations of the same function, even when the tracebacks are textually identical. This would not be possible in C -- gdb would have to approximate.

For completeness, here is the monkey-patching code that I used:

import socket

class Socket(socket.socket):
    def __init__(self, *args, **kwargs):
        tlog("create socket", self, args)
        super(Socket, self).__init__(*args, **kwargs)
    def connect(self, *args, **kwargs):
        tlog("connect", self, args)
        super(Socket, self).connect(*args, **kwargs)
    def connect_ex(self, *args, **kwargs):
        tlog("connect_ex", self, args)
        super(Socket, self).connect_ex(*args, **kwargs)
    def bind(self, *args, **kwargs):
        tlog("bind", self, args)
        super(Socket, self).bind(*args, **kwargs)

socket.socket = Socket

Monday, 31 August 2009

How to do model checking of Python code

I read about eXplode in early 2007. The paper describes a neat technique, inspired by model checking, for exhaustively exploring all execution paths of a program. The difference from model checking is that it doesn't require a model of the program to be written in a special language; it can operate on a program written in a normal language. The authors of the paper use the technique to check the correctness of Linux filesystems, and version control systems layered on top of them, in the face of crashes.

The basic idea is to have a special function, choose(), that performs a non-deterministic choice, e.g.

   choose([1, 2, 3])

Conceptually, this forks execution into three branches, returning 1, 2 or 3 in each branch.

The program can call choose() multiple times, so you end up with a choice tree.

The job of the eXplode check runner is to explore that choice tree, and to try to find sequences of choices that make the program fail.

eXplode does not try to snapshot the state of the program when choose() is called. (That's what other model checkers have tried to do, and it's what would happen if choose() were implemented using the Unix fork() call.)

Instead, eXplode restores the state of the program by running it from the start and replaying a previously-recorded sequence of choices by returning them one-by-one from choose(). A node in the choice tree is represented by the sequence of choices required to get to it, and so to explore that node's subtree the check runner will repeatedly run the program from the start, replaying the choices for that subtree. The check runner does a breadth-first traversal of the choice tree, so it has a queue of nodes to visit, and each visit will add zero or more subnodes to the queue. (See the code below for a concrete implementation.)

This replay technique means that the program must be deterministic, otherwise the choice tree won't get explored properly. All non-determinism must occur through choose().

There seem to be two situations in which you might call choose():

You can call choose() from your test code to generate input.
In eXplode, the test program writes files and tries to read them back, and it uses choose() to pick what kind of writes it does.
So far I have used choose() in tests to generate input in two different ways:
In one case, I started with valid input and mangled it in different ways, to check the error handling of the software-under-test. The test code used choose() to pick a file to delete or line to remove. The idea was that the software-under-test should give a nicely-formatted error message, specifying the faulty file, rather than a Python traceback. There was only a bounded number of faults to introduce so the choice tree was finite.
In another case, I built up XML trees to check that the software under test could process them. In this case the choice tree is infinite. I decided to prune it arbitrarily at a certain depth so that the test could run as part of a fairly quickly-running test suite.
In both these cases, using the eXplode technique wasn't strictly necessary. It would have been possible to write the tests without choose(), perhaps by copying the inputs before modifying them. But I think choose() makes the test more concise, and not all objects have interfaces for making copies.
You can also call choose() from the software-under-test to introduce deliberate faults, or from a fake object that the software-under-test is instantiated with.
In eXplode, the filesystem is tested with a fake block device which tries to simulate all the possible failures modes of a real hard disc (note that Flash drives have different failure modes!). The fake device uses choose() to decide when to simulate a crash and restart, and which blocks should be successfully written to the virtual device when the crash happens.
You can also wrap malloc() so that it non-deterministically returns NULL in order to test that software handles out-of-memory conditions gracefully. Testing error paths is difficult to do and this looks like a good way of doing it.
In these cases, the calls to choose() can be buried deeply in the call stack, and the eXplode check runner is performing an inversion of control that would be hard to do in a conventional test case.

Here's an example of how to implement the eXplode technique in Python:


import unittest


# If anyone catches and handles this, it will break the checking model.
class ModelCheckEscape(Exception):

    pass


class Chooser(object):

    def __init__(self, chosen, queue):
        self._so_far = chosen
        self._index = 0
        self._queue = queue

    def choose(self, choices):
        if self._index < len(self._so_far):
            choice = self._so_far[self._index]
            if choice not in choices:
                raise Exception("Program is not deterministic")
            self._index += 1
            return choice
        else:
            for choice in choices:
                self._queue.append(self._so_far + [choice])
            raise ModelCheckEscape()


def check(func):
    queue = [[]]
    while len(queue) > 0:
        chosen = queue.pop(0)
        try:
            func(Chooser(chosen, queue))
        except ModelCheckEscape:
            pass
        # Can catch other exceptions here and report the failure
        # - along with the choices that caused it - and then carry on.


class ModelCheckTest(unittest.TestCase):

    def test(self):
        got = []

        def func(chooser):
            # Example of how to generate a Cartesian product using choose():
            v1 = chooser.choose(["a", "b", "c"])
            v2 = chooser.choose(["x", "y", "z"])
            # We don't normally expect the function being tested to
            # capture state - in this case the "got" list - but we
            # do it here for testing purposes.
            got.append(v1 + v2)

        check(func)
        self.assertEquals(sorted(got),
                          sorted(["ax", "ay", "az",
                                  "bx", "by", "bz",
                                  "cx", "cy", "cz"]))


if __name__ == "__main__":
    unittest.main()

Sunday, 28 June 2009

Encapsulation in Python: two approaches

Time for another post on how object-capability security might be done in Python.

Suppose function closures in Python were encapsulated (i.e. the attributes func_closure, func_globals etc. were removed) and were the only route for getting encapsulation for non-builtin objects, as is the case in Python's old restricted mode (a.k.a. rexec). There would be two ways of defining new encapsulated objects. Let's call them the Cajita approach and the Bastion approach.

The Cajita-style approach (named after the Cajita subset of Javascript) is to define an object using a record of function closures, one closure per method. Here's an example, from a post from Ka-Ping Yee from 2003:

def Counter():
    self = Namespace()
    self.i = 0

    def next():
        self.i += 1
        return self.i

    return ImmutableNamespace(next)

(There are some more examples - a read-only FileReader object, and Mint and Purse objects based on the example from E - in Tav's post on object-capabilities in Python.)

This is instead of the far more idiomatic, but unencapsulated class definition:

class Counter(object):

    def __init__(self):
        self._i = 0

    def next(self):
        self._i += 1
        return self._i

(Ignoring that the more idiomatic way to do this is to use a generator or itertools.count().)

I am calling this the Cajita approach because it is analagous to how objects are defined in the Cajita subset of Javascript (part of the Caja project), where this code would be written as:

function Counter() {
    var i = 0;
    return Obj.freeze({
        next: function () {
            i += 1;
            return i;
        },
    });
}

The Python version of the Cajita style is a lot more awkward because Python is syntactically stricter than Javascript:

Expressions cannot contain statements, so function definitions cannot be embedded in an object creation expression. As a result, method names have to be specified twice: once in the method definitions and again when passed to ImmutableNamespace. (It would be three times if ImmutableNamespace did not use __name__.)
A function can only assign to a variable in an outer scope by using the nonlocal declaration (recently introduced), and that is awkward. The example works around this by doing "self = Namespace()" and assigning to the attribute self.i.

The Cajita style also has a memory usage cost. The size of each object will be O(n) in the number of methods, because each method is created as a separate function closure with its own pointer to the object's state. This is not trivial to optimise away because doing so would change the object's garbage collection behaviour.

For these reasons it would be better not to use the Cajita style in Python.

The alternative is the approach taken by the Bastion module, which the Python standard library used to provide for use with Python's restricted mode. (Restricted mode and Bastion.py are actually still present in Python 2.6, but restricted mode has holes and is unsupported, and Bastion.py is disabled.)

Bastion provided wrapper objects that exposed only the public attributes of the object being wrapped (where non-public attribute names are those that begin with an underscore). This approach means we can define objects using class definitions as usual, and wrap the resulting objects. To make a class encapsulated and uninheritable, we just have to add a decorator:

@sealed
class Counter(object):

    def __init__(self):
        self._i = 0

    def next(self):
        self._i += 1
        return self._i

where "sealed" can be defined as follows:

# Minimal version of Bastion.BastionClass.
# Converts a function closure to an object.
class BastionClass:

    def __init__(self, get):
        self._get = get

    def __getattr__(self, attr):
        return self._get(attr)

# Minimal version of Bastion.Bastion.
def Bastion(object):
    # Create a function closure wrapping the object.
    def get(attr):
        if type(attr) is str and not attr.startswith("_"):
            return getattr(object, attr)
        raise AttributeError(attr)
    return BastionClass(get)

def sealed(klass):
    def constructor(*args, **kwargs):
        return Bastion(klass(*args, **kwargs))
    return constructor

>>> c = Counter()
>>> c.next()
1
>>> c.next()
2
>>> c._i
AttributeError: _i

Mutability problem

One problem with the Bastion approach is that Bastion's wrapper objects are mutable. Although you can't use the Bastion object to change the wrapped object's attributes, you can change the Bastion object's attributes. This means that these wrapper objects cannot be safely shared between mutually distrusting components. Multiple holders of the object could use it as a communications channel or violate each other's expectations.

There isn't an obvious way to fix this in pure Python. Overriding __setattr__ isn't enough. I expect the simplest way to deal with this is to implement Bastion in a C extension module instead of in Python. The same goes for ImmutableNamespace (referred to in the first example) - the Cajita-style approach faces the same issue.

Wrapping hazard

There is a potential hazard in defining encapsulated objects via wrappers that is not present in CapPython. In methods, "self" will be bound to the private view of the object. There is a risk of accidentally writing code that passes the unwrapped self object to untrusted code.

An example:

class LoggerMixin(object):

    def sub_log(self, prefix):
        return PrefixLog(self, prefix)

@sealed
class PrefixLog(LoggerMixin):

    def __init__(self, log, prefix):
        self._log = log
        self._prefix = log

    def log(self, message):
        self._log.log("%s: %s" % (prefix, message))

(This isn't a great example because of the circular relationship between the two classes.)

This hazard is something we could lint for. It would be easy to warn about cases where self is used outside of an attribute access.

The method could be changed to:

    def sub_log(self, prefix):
        return PrefixLog(seal_object(self), prefix)

This creates a new wrapper object each time which is not always desirable. To avoid this we could store the wrapper object in an attribute of the wrapped object. Note that that would create a reference cycle, and while CPython's cycle collector is usually fine for collecting cycles, creating a cycle for every object might not be a good idea. An alternative would be to memoize seal_object() using a weak dictionary.

Sunday, 14 June 2009

Python standard library in Native Client

The Python standard library now works under Native Client in the web browser, including the Sqlite extension module.

By that I mean that it is possible to import modules from the standard library, but a lot of system calls won't be available. Sqlite works for in-memory use.

Here's a screenshot of the browser-based Python REPL:

The changes needed to make this work were quite minor:

Implementing stat() in glibc so that it does an RPC to the Javascript running on the web page. Python uses stat() when searching for modules in sys.path.
Changing fstat() in NaCl so that it doesn't return a fixed inode number. Python uses st_ino to determine whether an extension module is the same as a previously-loaded module, which doesn't work if st_ino is the same for different files.
Some tweaks to nacl-glibc-gcc, a wrapper script for nacl-gcc.
Ensuring that NaCl-glibc's libpthread.so works.

I didn't have to modify Python or Sqlite at all. I didn't even have to work out what arguments to pass to Sqlite's configure script - the Debian package for Sqlite builds without modification. It's just a matter of setting PATH to override gcc to run nacl-glibc-gcc.

Saturday, 18 April 2009

Python variable binding semantics, part 2

Time for me to follow up my previous blog post and explain the code snippets.

Snippet 1

funcs = []
for x in (1, 2, 3):
    funcs.append(lambda: x)
for f in funcs:
    print f()

Although you might expect this Python code snippet to print "1 2 3", it actually prints "3 3 3". The reason for this is that the same mutable variable binding for x is used across all iterations of the loop. The for loop does not create new variable bindings.

The function closures that are created inside the loop do not capture the value of x; they capture the cell object that x is bound to under the hood (or the globals dictionary if this code snippet is run at the top level), and each iteration mutates the same cell object.

Interestingly, Python does not restrict the target of the for loop to be a variable or tuple of variables. It can be any lvalue expression that can appear on the left hand side of an assignment, including indexing expressions. For example:

x = [10, 20]
for x[0] in (1,2,3):
    pass
print x # Prints [3, 20]

Snippet 2

This problem is Python's variable binding semantics is not specific to lambdas and also occurs if you use def.

funcs = []
for x in (1, 2, 3):
    def myfunc():
        return x
    funcs.append(myfunc)
for f in funcs:
    print f()

This also prints "3 3 3".

Snippet 3

The remaining snippets are examples of ways to work around this problem. They print the desired "1 2 3".

One way is to use default arguments:

funcs = []
for x in (1, 2, 3):
    funcs.append(lambda x=x: x)
for f in funcs:
    print f()

This is actually the trick that was used to get the effect of lexical scoping before lexical scoping was added to Python.

The default argument captures the value of x at the point where the function closure is created, and this value gets rebound to x when the function is called.

Although it is concise, I'm not too keen on this workaround. It is an abuse of default arguments. There is a risk that you accidentally pass too many arguments to the function and thereby override the captured value of x.

Snippet 4

funcs = []
for x in (1, 2, 3):
    def g(x):
        funcs.append(lambda: x)
    g(x)
for f in funcs:
    print f()

This is quite an ugly workaround, but it's perhaps the most general one. The loop body is wrapped in a function. The x inside function g and the x outside are statically different variable bindings. A new mutable binding of x is created on every call to g, but this binding is never modified.

Snippets 5 and 6

The remaining two snippets are attempts to make the code less ugly, by giving names to the intermediate functions that are introduced to rebind x and thereby give the desired behaviour.

funcs = []
def append_a_func(x):
    funcs.append(lambda: x)
for x in (1, 2, 3):
    append_a_func(x)
for f in funcs:
    print f()

def make_func(x):
    return lambda: x
funcs = []
for x in (1, 2, 3):
    funcs.append(make_func(x))
for f in funcs:
    print f()

I suppose there is a downside to de-uglifying the code. If the code looks too normal, there's a risk that someone will refactor it to the incorrect version without testing that it works properly when looping over lists containing multiple items.

Can the language be fixed?

The root cause is that Python does away with explicit variable binding declarations. If a variable is assigned within a function -- including being assigned as the target of a for loop -- it becomes local to the function, but variables are never local to smaller statement blocks such as a loop body. Python doesn't have block scoping. There are a couple of naive ways in which you might try to fix the language so that the code does what you would expect, both of which have problems. We could change the semantics of function closure creation so that closures take a snapshot of variables' values. But this would break mutually recursive function definitions, and cases where functions are referred to before they are defined, such as:

def f(): # This "def" would raise an UnboundLocalError or a NameError
    g()
def g():
    pass

We could change the semantics of for loops so that the target variable is given a variable binding that is local to the loop body. But this wouldn't change the status of variable bindings created by assignment, so this code would still have the unwanted behaviour:

for x in (1, 2, 3):
    y = x
    funcs.append(lambda: y)

Linting

It should be possible to create a lint tool to detect this hazard. It could look for closures that capture variables that are assigned by or in loops. I wonder how many false positives it would give on a typical codebase.

Javascript

Javascript suffers from the same problem:

funcs = [];
for(var x = 0; x < 3; x++) {
    funcs.push(function() { return x; })
}
for(var i = 0; i < funcs.length; i++) {
    print(funcs[i]());
}

This code prints "3 3 3" when run using rhino.

In Javascript's case the affliction is entirely gratuitous. It happens because var declarations are implicitly hoisted up to the top of the function, for no apparent reason. It is gratuitous because this is not the result of a trade-off.

The good news, though, is that the problem is recognised and a fix to Javascript is planned. I think the plan is for Ecmascript 3.1 to introduce a let declaration as an alternative to var without the hoisting behaviour.

List comprehensions

Back to Python. The same problem also applies to list comprehensions and generator expressions.

funcs = [(lambda: x) for x in (1, 2, 3)]
for f in funcs:
    print f()

funcs = list((lambda: x) for x in (1, 2, 3))
for f in funcs:
    print f()

These both print "3 3 3".

(Note that I added brackets around the lambdas for clarity but the syntax does not require them.)

This is forgivable for list comprehensions, at least in Python 2.x, because the assignment to x escapes into the surrounding function. (See my earlier post.)

But for generators (and for list comprehensions in Python 3.0), the scope of x is limited to the comprehension. Semantically, it would be easy to limit the scope of x to within a loop iteration, so that each iteration introduces a new variable binding.

Tuesday, 31 March 2009

Python variable binding semantics

What do the six following chunks of code do, and what would you like them to do? Which do you prefer?

funcs = []
for x in (1, 2, 3):
    funcs.append(lambda: x)
for f in funcs:
    print f()

funcs = []
for x in (1, 2, 3):
    def myfunc():
        return x
    funcs.append(myfunc)
for f in funcs:
    print f()

funcs = []
for x in (1, 2, 3):
    funcs.append(lambda x=x: x)
for f in funcs:
    print f()

funcs = []
for x in (1, 2, 3):
    def g(x):
        funcs.append(lambda: x)
    g(x)
for f in funcs:
    print f()

funcs = []
def append_a_func(x):
    funcs.append(lambda: x)
for x in (1, 2, 3):
    append_a_func(x)
for f in funcs:
    print f()

def make_func(x):
    return lambda: x
funcs = []
for x in (1, 2, 3):
    funcs.append(make_func(x))
for f in funcs:
    print f()

Friday, 16 January 2009

Testing using golden files in Python

This is the third post in a series about automated testing in Python (earlier posts: 1, 2). This post is about testing using golden files.

A golden test is a fancy way of doing assertEquals() on a string or a directory tree, where the expected output is kept in a separate file or files -- the golden files. If the actual output does not match the expected output, the test runner can optionally run an interactive file comparison tool such as Meld to display the differences and allow you to selectively merge the differences into the golden file.

This is useful when

the data being checked is large - too large to embed into the Python source; or
the data contains relatively inconsequential details, such as boilerplate text or formatting, which might be changed frequently.

As an example, suppose you have code to format some data as HTML. You can have a test which creates some example data, formats this as HTML, and compares the result to a golden file. Something like this:

class ExampleTest(golden_test.GoldenTestCase):

    def test_formatting_html(self):
        obj = make_some_example_object()
        temp_dir = self.make_temp_dir()
        format_as_html(obj, temp_dir)
        self.assert_golden(temp_dir, os.path.join(os.path.dirname(__file__),
                                                  "golden-files"))

if __name__ == "__main__":
    golden_test.main()

By default, the test runs non-interactively, which is what you want on a continuous integration machine, and it will print a diff if it fails. To switch on the semi-interactive mode which runs Meld, you run the test with the option --meld.

Here is a simple version of the test helper (taken from here):

import os
import subprocess
import sys
import unittest

class GoldenTestCase(unittest.TestCase):

    run_meld = False

    def assert_golden(self, dir_got, dir_expect):
        assert os.path.exists(dir_expect), dir_expect
        proc = subprocess.Popen(["diff", "--recursive", "-u", "-N",
                                 "--exclude=.*", dir_expect, dir_got],
                                stdout=subprocess.PIPE)
        stdout, stderr = proc.communicate()
        if len(stdout) > 0:
            if self.run_meld:
                # Put expected output on the right because that is the
                # side we usually edit.
                subprocess.call(["meld", dir_got, dir_expect])
            raise AssertionError(
                "Differences from golden files found.\n"
                "Try running with --meld to update golden files.\n"
                "%s" % stdout)
        self.assertEquals(proc.wait(), 0)

def main():
    if sys.argv[1:2] == ["--meld"]:
        GoldenTestCase.run_meld = True
        sys.argv.pop(1)
    unittest.main()

(It's a bit cheesy to modify global state on startup to enable melding, but because unittest doesn't make it easy to pass parameters into tests this is the simplest way of doing it.)

Golden tests have the same sort of advantages that are associated with test-driven development in general.

Golden files are checked into version control and help to make changesets self-documenting. A changeset that affects the program's output will include patches that demonstrate how the output is affected. You can see the history of the program's output in version control. (This assumes that everyone runs the tests before committing!)
Sometimes you can point people at the golden files if they want to see example output. For HTML, sometimes you can contrive the CSS links to work so that the HTML looks right when viewed in a browser.
And of course, this can catch cases where you didn't intend to change the program's output.

My typical workflow is to add a test with some example input that I want the program to handle, and run the test with --meld until the output I want comes up on the left-hand side in Meld. I mark the output as OK by copying it over to the right-hand side. This is not quite test-first, because I am letting the test suggest what its expected output should be. But it is possible to do this in an even more test-first manner by typing the output you expect into the right-hand side.

Other times, one change will affect many locations in the golden files, and adding a new test is not necessary. It's usually not too difficult to quickly eyeball the differences with Meld.

Here are some of the things that I have used golden files to test:

formatting of automatically-generated e-mails
automatically generated configuration files
HTML formatting logs (build_log_test.py and its golden files)
pretty-printed output of X Windows messages in xjack-xcb (golden_test.py and golden_test_data). This ended up testing several components in one go:
- the XCB protocol definitions for X11
- the encoders and decoders that work off of the XCB protocol definitions
- the pretty printer for the decoder

On the other hand, sometimes the overhead of having a separate file isn't worth it, and if the expected output is small (maybe 5 lines or so) I might instead paste it into the Python source as a multiline string.

It can be tempting to overuse golden tests. As with any test suite, avoid creating one big example that tries to cover all cases. (This is particularly tempting if the test is slow to run.) Try to create smaller, separate examples. The test helper above is not so good at this, because if you are not careful it can end up running Meld several times. In the past I have put several examples into (from unittest's point of view) one test case, so that it runs Meld only once.

Golden tests might not work so well if components can be changed independently. For example, if your XML library changes its whitespace pretty-printing, the tests' output could change. This is less of a problem if your code is deployed with tightly-controlled versions of libraries, because you can just update the golden files when you upgrade libraries.

A note on terminology: I think I got the term "golden file" from my previous workplace, where other people were using them, and the term seems to enjoy some limited use judging from Google. "Golden test", however, may have been a term that I have made up and that no-one else outside my workplace is using for this meaning.

Wednesday, 17 December 2008

Helper for monkey patching in tests

Following on from my post on TempDirTestCase, here is another Python test case helper which I introduced at work. Our codebase has quite a few test cases which use monkey patching. That is, they temporarily modify a module or a class to replace a function or method for the duration of the test.

For example, you might want to monkey patch time.time so that it returns repeatable timestamps during the test. We have quite a lot of test cases that do something like this:

class TestFoo(unittest.TestCase):

    def setUp(self):
        self._old_time = time.time
        def monkey_time():
            return 0
        time.time = monkey_time

    def tearDown(self):
        time.time = self._old_time

    def test_foo(self):
        # body of test case

Having to save and restore the old values gets tedious, particularly if you have to monkey patch several objects (and, unfortunately, there are a few tests that monkey patch a lot). So I introduced a monkey_patch() method so that the code above can be simplified to:

class TestFoo(TestCase):

    def test_foo(self):
        self.monkey_patch(time, "time", lambda: 0)
        # body of test case

(OK, I'm cheating by using a lambda the second time around to make the code look shorter!)

Now, monkey patching is not ideal, and I would prefer not to have to use it. When I write new code I try to make sure that it can be tested without resorting to monkey patching. So, for example, I would parameterize the software under test to take time.time as an argument instead of getting it directly from the time module. (here's an example).

But sometimes you have to work with a codebase where most of the code is not covered by tests and is structured in such a way that adding tests is difficult. You could refactor the code to be more testable, but that risks changing its behaviour and breaking it. In that situation, monkey patching can be very useful. Once you have some tests, refactoring can become easier and less risky. It is then easier to refactor to remove the need for monkey patching -- although in practice it can be hard to justify doing that, because it is relatively invasive and might not be a big improvement, and so the monkey patching stays in.

Here's the code, an extended version of the base class from the earlier post:

import os
import shutil
import tempfile
import unittest

class TestCase(unittest.TestCase):

    def setUp(self):
        self._on_teardown = []

    def make_temp_dir(self):
        temp_dir = tempfile.mkdtemp(prefix="tmp-%s-" % self.__class__.__name__)
        def tear_down():
            shutil.rmtree(temp_dir)
        self._on_teardown.append(tear_down)
        return temp_dir

    def monkey_patch(self, obj, attr, new_value):
        old_value = getattr(obj, attr)
        def tear_down():
            setattr(obj, attr, old_value)
        self._on_teardown.append(tear_down)
        setattr(obj, attr, new_value)

    def monkey_patch_environ(self, key, value):
        old_value = os.environ.get(key)
        def tear_down():
            if old_value is None:
                del os.environ[key]
            else:
                os.environ[key] = old_value
        self._on_teardown.append(tear_down)
        os.environ[key] = value

    def tearDown(self):
        for func in reversed(self._on_teardown):
            func()

Thursday, 18 September 2008

Attribute access in format strings in Python 3.0

Here is another problem with securing Python 3.0: PEP 3101 has extended format strings so that they contain an attribute access syntax. This makes the format() method on strings too powerful. It exposes unrestricted use of getattr, so it can be used to read private attributes.

For example, the expression "{0._foo}".format(x) is equivalent to str(x._foo).

CapPython could work around this, but it would not be simple. It could block the "format" attribute (that is, treat it as a private attribute), although that is jarring because this word does not even look special. Lots of code will want to use format(), so we would have to rewrite this to a function call that interprets format strings safely. Having to do rewriting increases complexity. And if our safe format string interpreter is written in Python, it will be slower than the built-in version.

My recommendation would be to take out the getattr-in-format-strings feature before Python 3.0 is released. Once the release has happened it would be much easier to add such a feature than to take it out.

It is a real shame that Python 3 has not adopted E's quasi-literals feature, which has been around for a while. Not only are quasi-literals more secure than PEP 3101's format strings, they are more general (because they allow any kind of subexpression) and could be faster (because more can be done at compile time).

Sunday, 14 September 2008

CapPython, unbound methods and Python 3.0

CapPython needs to block access to method functions. Luckily, one case is already covered by Python's unbound methods, but they are going away in Python 3.0.

Consider the following piece of Python code:

class C(object):

    def f(self):
        return self._field

x = C()

In Python, methods are built out of functions. "def" always defines a function. In this case, the function f is defined in a class scope, so the function gets wrapped up inside a class object, making it available as a method.

There are three ways in which we might use function f:

Via instances of class C as a normal method, e.g. x.f(). This is the common case. The expression x.f returns a bound method, which wraps the instance x and the function f.
Via class C, e.g. C.f(x). The expression C.f returns an unbound method. If you call this unbound method with C.f(y), it first checks that y is an instance of C. If that is the case, it calls f(x). Otherwise, it raises a TypeError.
Directly, as a function, assuming you can get hold of the unwrapped function. There are several ways to get hold of the function:
- x.f.im_func or C.f.im_func. Bound and unbound methods make the function they wrap available via an attribute called "im_func".
- In class scope, "f" is visible directly as a variable.
- C.__dict__["f"]

CapPython allows the first two but aims to block direct use of method functions.

In CapPython, attribute access is restricted so that you can only access private attributes (those starting with an underscore) via a "self" variable inside a method function. For this to work, access to methods functions must be restricted. Function f should only ever be used on instances of C and its subclasses.

Suppose that constraint was violated. If you could get hold of the unwrapped function f, you could apply it to an object y of any type, and f(y) would return the value of the private attribute, y._field. That would violate encapsulation.

To enforce encapsulation, CapPython blocks the paths for getting hold of f that are listed above, as well as some others:

"im_func" is treated as a private attribute, even though it doesn't start with an underscore.
In class scope, reading the variable f is forbidden. Or, to be more precise, if variable f is read, f is no longer treated as a method function, and its access to self._field is forbidden.
__dict__ is a private attribute, so the expression C.__dict__ is rejected.
Use of __metaclass__ is blocked, because it provides another way of getting hold of a class's __dict__.
Use of decorators is restricted.

Bound methods and unbound methods both wrap up function f so that it can be used safely.

However, this is changing in Python 3.0. Unbound methods are being removed. This means that C.f simply returns function f. If CapPython is going to work on Python 3.0, I am afraid it will have to become a lot more complicated. CapPython would have to apply rewriting to class definitions so that class objects do not expose unwrapped method functions. Pre-3.0 CapPython has been able to get away without doing source rewriting.

Pre-3.0 CapPython has a very nice property: It is possible for non-CapPython-verified code to pass classes and objects into verified CapPython code without allowing the latter to break encapsulation. The non-verified code has to be careful not to grant the CapPython code unsafe objects such as "type" or "globals" or "getattr", but the chances of doing that are fairly low, and this is something we could easily lint for. However, if almost every class in Python 3.0 provides access to objects that break CapPython's encapsulation (that is, method functions), so that the non-CapPython code must wrap every class, the risks of combining code in this way are significantly increased.

Ideally, I'd like to see this change in Python 3.0 reverted. Unbound methods were scheduled for removal in a one-liner in PEP 3100. This originated in a comment on Guido van Rossum's blog and a follow-on thread. The motivation seems to be to simplify the language, which is often good, but not in this case. However, I'm about 3 years too late, and Python 3.0 is scheduled to be released in the next few weeks.

Lacking Rhoticity