Thursday, 7 August 2008

Introducing CapPython

Python is not a language that provides encapsulation. That is, it does not enforce any difference between the private and public parts of an object. All attributes of an object are public from the language's point of view. Even functions are not encapsulated: you can access the internals of a function through the attributes func_closure, func_globals, etc.

However, Python has a convention for private attributes of objects which is widely used. It's written down in PEP 0008 (from 2001). Attributes that start with an underscore are private. (Actually PEP 0008 uses the term "non-public" but let's put that aside for now.)

CapPython proposes that we enforce this convention by defining a subset of Python to enforce it. The hope is that this subset could be an object-capability language. Hopefully we can do this in such a way that you can get encapsulation by default and still have fairly idiomatic Python code.

The core idea is that private attributes may only be accessed through "self" variables. (We have to expand the definition of "private attribute" to include attributes starting with "func_" and some other prefixes that are used for Python built-in objects.)

As an example, suppose we want to implement a read-only wrapper around dictionary objects:

class FrozenDict(object):
    def __init__(self, dictionary):
        self._dict = dictionary
    def get(self, key):
        return self._dict.get(key)
    # This is incomplete: there are other methods in the dict interface.
You can do this:
>>> d = FrozenDict({"a": 1})
>>> d.get("a")
1
>>> d.set("a", 2)
AttributeError
but the following code is statically rejected:
>>> d._dict
because _dict is a private attribute and d is not a "self" variable.

A self variable is a variable that is the first argument of a method function. A method function is a function defined on a class (with some restrictions to prevent method functions from escaping and being used in ways that would break encapsulation).

We also have to disallow all assignments to attributes (both public and private) except through "self". This is a harsher restriction. Otherwise a recipient of a FrozenDict could modify the object:

def my_function(key):
    return "Not the dictionary item you expected"
d.get = my_function
and the FrozenDict instance would no longer be frozen.

This scheme has some nice properties. As with lambda-style object definitions in E, encapsulation is enforced statically. No type checking is required; it's just a syntactic check. No run-time checks need to be added.

Furthermore, instance objects do not need to take any special steps to defend themselves; they are encapsulated by default. We don't need to wrap all objects to hide their private attributes (which is the approach that some attempts at a safer Python have taken). Class definitions do not need to inherit from some special base class. This means that TCB objects can be written in normal Python and passed into CapPython safely; they are defended by default from CapPython code.

However, class objects are not encapsulated by default. A class object has at least two roles: it acts as a constructor function, and it can be used to derive new classes. The new classes can access their instance objects' private attributes (which are really "protected" attributes in Java terminology - one reason why PEP 0008 does not use the word "private"). So you might want to make a class "final", as in not inheritable. One way to do that is to wrap the class so that the constructor is available, but the class itself is not:

class FrozenDict(object):
    ...
def make_frozen_dict(*args):
    return FrozenDict(*args)
The function make_frozen_dict is what you would export to other modules, while FrozenDict would be closely-held.

Maybe this wrapping should be done by default so that the class is encapsulated by default, but it's not yet clear how best to do so, or how the default would be overridden.

I have started writing a static verifier for CapPython. The code is on Launchpad. It is not yet complete. It does not yet block access to Python's builtin functions such as open, and it does not yet deal with Python's module system.

Tuesday, 5 August 2008

Four Python variable binding oddities

Python has some strange variable binding semantics. Here are some examples.

Oddity 1: If Python were a normal lambda language, you would expect the expression x to be equivalent to (lambda: x)(). I mean x to be a variable name here, but you would expect the equivalence to hold if x were any expression. However, there is one context in which the two are not equivalent: class scope.

x = 1
class C:
    x = 2
    print x
    print (lambda: x)()
Expected output:
2
2
Actual output:
2
1
There is a fairly good reason for this.

Oddity 2: This is also about class scope. If you're familiar with Python's list comprehensions and generator expressions, you might expect list comprehensions to be just a special case of generators that evaluates the sequence up-front.

x = 1
class C:
    x = 2
    print [x for y in (1,2)]
    print list(x for y in (1,2))
Expected output:
[2, 2]
[2, 2]
Actual output:
[2, 2]
[1, 1]
This happens for a mixture of good reasons and bad reasons. List comprehensions and generators have different variable binding rules. Class scopes are somewhat odd, but they are at least consistent in their oddness. If list comprehensions and generators are brought into line with each other, you would actually expect to get this output:
[1, 1]
[1, 1]
Otherwise class scopes would not behave as consistently.

Oddity 3:

x = "top"
print (lambda: (["a" for x in (1,2)], x))()
print (lambda: (list("a" for x in (1,2)), x))()
Expected output might be:
(['a', 'a'], 'top')
(['a', 'a'], 'top')
Or if you're aware of list comprehension oddness, you might expect it to be:
(['a', 'a'], 2)
(['a', 'a'], 2)
(assuming this particular ordering of the "print" statements) But it's actually:
(['a', 'a'], 2)
(['a', 'a'], 'top')
If you thought that you can't assign to a variable in an expression in Python, you'd be wrong. This expression:
[1 for x in [100]]
is equivalent to this statement:
x = 100
Oddity 4: Back to class scopes again.
x = "xtop"
y = "ytop"
def func():
    x = "xlocal"
    y = "ylocal"
    class C:
        print x
        print y
        y = 1
func()
Naively you might expect it to print this:
xlocal
ylocal
If you know a bit more you might expect it to print something like this:
xlocal
Traceback ... UnboundLocalError: local variable 'y' referenced before assignment
(or a NameError instead of an UnboundLocalError)
Actually it prints this:
xlocal
ytop
I think this is the worst oddity, because I can't see a good use for it. For comparison, if you replace "class C" with a function scope, as follows:
x = "xtop"
y = "ytop"
def func():
    x = "xlocal"
    y = "ylocal"
    def g():
        print x
        print y
        y = 1
    g()
func()
then you get:
xlocal
Traceback ... UnboundLocalError: local variable 'y' referenced before assignment
I find that more reasonable.

Why bother? These issues become important if you want to write a verifier for an object-capability subset of Python. Consider an expression like this:

(lambda: ([open for open in (1,2)], open))()
It could be completely harmless, or it might be so dangerous that it could give the program that contains it the ability to read or write any of your files. You'd like to be able to tell. This particular expression is harmless. Or at least it is harmless until a new release of Python changes the semantics of list comprehensions...