Saturday 23 October 2010

Process descriptors in FreeBSD-Capsicum

Capsicum is a set of new features for FreeBSD that adds better support for sandboxing, adding a capability mode in which the capabilities are Unix file descriptors (FDs). The features Capsicum adds are orthogonal, which is nice. One of the new features is process descriptors.

Capsicum adds a replacement for fork() called pdfork(), which returns a process descriptor (a new type of FD) rather than a PID. Similarly, there are replacements for wait() and kill() -- pdwait() and pdkill() -- which take FDs as arguments instead of PIDs.

The reason for the new interface is that kill() is not safe to allow in Capsicum's sandbox, because it provides ambient authority: it looks up its PID argument in a global namespace.

But even if you ignore sandboxing issues, this new interface is a significant improvement on POSIX process management:

  1. It allows the ability to wait on a process to be delegated to another process. In contrast, with wait()/waitpid(), a process's exit status can only be read by the process's parent.
  2. Process descriptors can be used with poll(). This avoids the awkwardness of having to use SIGCHLD, which doesn't work well if multiple libraries within the same process want to wait() for child processes.
  3. It gets rid of the race condition associated with kill(). Sending a signal to a PID is dodgy because the original process with this PID could have exited, and the kernel could have recycled the PID for an unrelated process, especially on a system where processes are spawned and exit frequently.

    kill() is only really safe when used by a parent process on its child, and only when the parent makes sure to use it before wait() has returned the child's exit status. pdkill() gets rid of this problem.

  4. In future, process descriptors can be extended to provide access to the process's internal state for debugging purposes, e.g. for reading registers and memory, or modifying memory mappings or the FD table. This would be an improvement on Linux's ptrace() interface.

However, there is one aspect of Capsicum's process descriptors that I think was a mistake: Dropping the process descriptor for a process causes the kernel to kill it. (By this I mean that if there are no more references to the process descriptor, because they have been close()'d or because the processes holding them have exited, the kernel will terminate the process.)

The usual principle of garbage collection in programming languages is that GC should not affect the observable behaviour of the program (except for resource usage). Capsicum's kill-on-close() behaviour violates that principle.

kill-on-close() is based on the assumption that the launcher of a subprocess always wants to check the subprocess's exit status. But this is not always the case. Exit status is just one communications channel, but not one we always care about. It's common to launch a process to communicate with it via IPC without wanting to have to wait() for it. (Double fork()ing is a way to do this in POSIX without leaving a zombie process.) Many processes are fire-and-forget.

The situation is analogous for threads. pthreads provides pthread_detach() and PTHREAD_CREATE_DETACHED as a way to launch a thread without having to pthread_join() it.

In Python, when you create a thread, you get a Thread object back, but if the Thread object is GC'd the thread won't be killed. In fact, finalisation of a thread object is implemented using pthread_detach() on Unix.

I know of one language implementation that (as I recall) will GC threads, Mozart/Oz, but this only happens when a thread can have no further observable effects. In Oz, if a thread blocks waiting for the resolution of a logic variable that can never be resolved (because no other thread holds a reference to it), then the thread can be GC'd. So deadlocked threads can be GC'd just fine. Similarly, if a thread holds no communications channels to the outside world, it can be GC'd safely.

Admittedly, Unix already violates this GC principle by the finalisation of pipe FDs and socket FDs, because dropping one endpoint of the socket/pipe pair is visible as an EOF condition on the other endpoint. However, this is usually used to free up resources or to unblock a process that would otherwise fail to make progress. Dropping a process descriptor would halt progress -- the opposite. Socket/pipe EOF can be used to do this too, but it is less common, and I'm not sure we should encourage this.

However, aside from this complaint, process descriptors are a good idea. It would be good to see them implemented in Linux.