Thursday, 18 September 2008

Attribute access in format strings in Python 3.0

Here is another problem with securing Python 3.0: PEP 3101 has extended format strings so that they contain an attribute access syntax. This makes the format() method on strings too powerful. It exposes unrestricted use of getattr, so it can be used to read private attributes.

For example, the expression "{0._foo}".format(x) is equivalent to str(x._foo).

CapPython could work around this, but it would not be simple. It could block the "format" attribute (that is, treat it as a private attribute), although that is jarring because this word does not even look special. Lots of code will want to use format(), so we would have to rewrite this to a function call that interprets format strings safely. Having to do rewriting increases complexity. And if our safe format string interpreter is written in Python, it will be slower than the built-in version.

My recommendation would be to take out the getattr-in-format-strings feature before Python 3.0 is released. Once the release has happened it would be much easier to add such a feature than to take it out.

It is a real shame that Python 3 has not adopted E's quasi-literals feature, which has been around for a while. Not only are quasi-literals more secure than PEP 3101's format strings, they are more general (because they allow any kind of subexpression) and could be faster (because more can be done at compile time).

Sunday, 14 September 2008

CapPython, unbound methods and Python 3.0

CapPython needs to block access to method functions. Luckily, one case is already covered by Python's unbound methods, but they are going away in Python 3.0.

Consider the following piece of Python code:

class C(object):

    def f(self):
        return self._field

x = C()

In Python, methods are built out of functions. "def" always defines a function. In this case, the function f is defined in a class scope, so the function gets wrapped up inside a class object, making it available as a method.

There are three ways in which we might use function f:

  • Via instances of class C as a normal method, e.g. x.f(). This is the common case. The expression x.f returns a bound method, which wraps the instance x and the function f.
  • Via class C, e.g. C.f(x). The expression C.f returns an unbound method. If you call this unbound method with C.f(y), it first checks that y is an instance of C. If that is the case, it calls f(x). Otherwise, it raises a TypeError.
  • Directly, as a function, assuming you can get hold of the unwrapped function. There are several ways to get hold of the function:
    • x.f.im_func or C.f.im_func. Bound and unbound methods make the function they wrap available via an attribute called "im_func".
    • In class scope, "f" is visible directly as a variable.
    • C.__dict__["f"]

CapPython allows the first two but aims to block direct use of method functions.

In CapPython, attribute access is restricted so that you can only access private attributes (those starting with an underscore) via a "self" variable inside a method function. For this to work, access to methods functions must be restricted. Function f should only ever be used on instances of C and its subclasses.

Suppose that constraint was violated. If you could get hold of the unwrapped function f, you could apply it to an object y of any type, and f(y) would return the value of the private attribute, y._field. That would violate encapsulation.

To enforce encapsulation, CapPython blocks the paths for getting hold of f that are listed above, as well as some others:

  • "im_func" is treated as a private attribute, even though it doesn't start with an underscore.
  • In class scope, reading the variable f is forbidden. Or, to be more precise, if variable f is read, f is no longer treated as a method function, and its access to self._field is forbidden.
  • __dict__ is a private attribute, so the expression C.__dict__ is rejected.
  • Use of __metaclass__ is blocked, because it provides another way of getting hold of a class's __dict__.
  • Use of decorators is restricted.

Bound methods and unbound methods both wrap up function f so that it can be used safely.

However, this is changing in Python 3.0. Unbound methods are being removed. This means that C.f simply returns function f. If CapPython is going to work on Python 3.0, I am afraid it will have to become a lot more complicated. CapPython would have to apply rewriting to class definitions so that class objects do not expose unwrapped method functions. Pre-3.0 CapPython has been able to get away without doing source rewriting.

Pre-3.0 CapPython has a very nice property: It is possible for non-CapPython-verified code to pass classes and objects into verified CapPython code without allowing the latter to break encapsulation. The non-verified code has to be careful not to grant the CapPython code unsafe objects such as "type" or "globals" or "getattr", but the chances of doing that are fairly low, and this is something we could easily lint for. However, if almost every class in Python 3.0 provides access to objects that break CapPython's encapsulation (that is, method functions), so that the non-CapPython code must wrap every class, the risks of combining code in this way are significantly increased.

Ideally, I'd like to see this change in Python 3.0 reverted. Unbound methods were scheduled for removal in a one-liner in PEP 3100. This originated in a comment on Guido van Rossum's blog and a follow-on thread. The motivation seems to be to simplify the language, which is often good, but not in this case. However, I'm about 3 years too late, and Python 3.0 is scheduled to be released in the next few weeks.

Dealing with modules and builtins in CapPython

In my previous post, Introducing CapPython, I wrote that CapPython "does not yet block access to Python's builtin functions such as open, and it does not yet deal with Python's module system".

Dealing with modules and builtins has turned out to be easier than I thought.

At the time, I had in mind that CapPython could work like the Joe-E verifier. Joe-E is a static verifier for an object-capability subset of Java. If your Java code passes Joe-E, you can compile it, put it in your CLASSPATH, and load it with the normal Java class loader. If your Joe-E code uses any external classes, Joe-E statically checks that these classes (and the methods used on them) have been whitelisted as safe.

I had envisaged that CapPython would work in a similar way. Module imports would be checked against a whitelisted set. Uses of builtin functions would be checked: len would be allowed, but open would be rejected. CapPython code would be loaded through Python's normal module loader; code would be installed by adding it to PYTHONPATH as usual. (One difference from Joe-E is that CapPython would not be able to statically block the use of a method from a particular class or interface, because Python is not statically typed. CapPython can block methods only by their name.)

But there are some problems with this approach:

  • This provides a way to block builtins such as getattr and open, but not to replace them. We would want to change getattr to reject (at runtime) attribute names starting with "_". We could introduce a safe_getattr function, but we'd have to change code to import and use a differently named function.
  • It makes it hard to modify or replace modules.
  • In turn, that makes it hard to subset modules. Suppose you want to allow os.path.join, but not the rest of os. Doing this via a static check is awkward; it would have to block use of os as a first class value.
  • It makes it harder to apply rewriting to code, if it turns out that CapPython needs to be a rewriter and not just a verifier.
  • It would require reasoning about Python's module loader.
  • It's not clear who does the checking and who initiates the loading, so there could be a risk that the two are not operating on the same code.
  • It relies on global state like PYTHONPATH, sys.modules and the filesystem.
  • It doesn't let us instantiate modules multiple times. Sometimes you want to instantiate a module multiple times because it contains mutable state, or because you want the instantiations to use different imports. Some Python libraries have global state that would be awkward to remove, such as Twisted's default reactor (an event dispatcher). Joe-E rejects global state by rejecting static variables, but this is much harder to do in Python.

There is a simpler, more flexible approach, which is closer to what E and Caja/Cajita do: implement a custom module loader, and load all CapPython code through that. All imports that a CapPython module does would also go through the custom module loader.

It just so happens that Python provides enough machinery for that to work. Python's exec interface makes it possible to execute source code in a custom top-level scope, and that scope does not have to include Python's builtins if you set the special __builtins__ variable in the scope. Behind the scenes, Python's "import" statement is implemented via a function called __import__ which the interpreter looks up from a module's top-level scope as __builtins__["__import__"], so it can be replaced on a per-module basis.

Rather than trying to statically verify that Python's default global scope (which includes its default unsafe module importer and unsafe builtins) is used safely, we just substitute a safe scope - "scope substitution", as Mark Miller referred to it.

This is now implemented in CapPython. It has a "safe eval" function and a module loader. The aim is that the module loader will be able to run as untrusted CapPython code. Currently the module loader uses open to read files, so it has the run of the filesystem, but eventually that will be tamed.

I am surprised that custom module loaders are not used more often in Python.

I imagine Joe-E could work the same way, because Java allows you to create new class loaders.