Dealing with modules and builtins has turned out to be easier than I thought.
At the time, I had in mind that CapPython could work like the Joe-E verifier. Joe-E is a static verifier for an object-capability subset of Java. If your Java code passes Joe-E, you can compile it, put it in your CLASSPATH, and load it with the normal Java class loader. If your Joe-E code uses any external classes, Joe-E statically checks that these classes (and the methods used on them) have been whitelisted as safe.
I had envisaged that CapPython would work in a similar way. Module imports would be checked against a whitelisted set. Uses of builtin functions would be checked: len would be allowed, but open would be rejected. CapPython code would be loaded through Python's normal module loader; code would be installed by adding it to PYTHONPATH as usual. (One difference from Joe-E is that CapPython would not be able to statically block the use of a method from a particular class or interface, because Python is not statically typed. CapPython can block methods only by their name.)
But there are some problems with this approach:
- This provides a way to block builtins such as getattr and open, but not to replace them. We would want to change getattr to reject (at runtime) attribute names starting with "_". We could introduce a safe_getattr function, but we'd have to change code to import and use a differently named function.
- It makes it hard to modify or replace modules.
- In turn, that makes it hard to subset modules. Suppose you want to allow os.path.join, but not the rest of os. Doing this via a static check is awkward; it would have to block use of os as a first class value.
- It makes it harder to apply rewriting to code, if it turns out that CapPython needs to be a rewriter and not just a verifier.
- It would require reasoning about Python's module loader.
- It's not clear who does the checking and who initiates the loading, so there could be a risk that the two are not operating on the same code.
- It relies on global state like PYTHONPATH, sys.modules and the filesystem.
- It doesn't let us instantiate modules multiple times. Sometimes you want to instantiate a module multiple times because it contains mutable state, or because you want the instantiations to use different imports. Some Python libraries have global state that would be awkward to remove, such as Twisted's default reactor (an event dispatcher). Joe-E rejects global state by rejecting static variables, but this is much harder to do in Python.
There is a simpler, more flexible approach, which is closer to what E and Caja/Cajita do: implement a custom module loader, and load all CapPython code through that. All imports that a CapPython module does would also go through the custom module loader.
It just so happens that Python provides enough machinery for that to work. Python's exec interface makes it possible to execute source code in a custom top-level scope, and that scope does not have to include Python's builtins if you set the special __builtins__ variable in the scope. Behind the scenes, Python's "import" statement is implemented via a function called __import__ which the interpreter looks up from a module's top-level scope as __builtins__["__import__"], so it can be replaced on a per-module basis.
Rather than trying to statically verify that Python's default global scope (which includes its default unsafe module importer and unsafe builtins) is used safely, we just substitute a safe scope - "scope substitution", as Mark Miller referred to it.
This is now implemented in CapPython. It has a "safe eval" function and a module loader. The aim is that the module loader will be able to run as untrusted CapPython code. Currently the module loader uses open to read files, so it has the run of the filesystem, but eventually that will be tamed.
I am surprised that custom module loaders are not used more often in Python.
I imagine Joe-E could work the same way, because Java allows you to create new class loaders.