Monday 25 January 2010

Why system calls should be message-passing

Message-passing syscalls on Linux include read(), write(), sendmsg() and recvmsg(). These are message-passing because:
  1. They take a file descriptor as an explicit argument. This specifies the object to send a message to or receive a message from.
  2. The message to send (or receive) consists of an array of bytes, and maybe an array of file descriptors too (via SCM_RIGHTS). The syscall interacts with the process's address space (or file descriptor table) in a well-defined, uniform way. The caller specifies which locations are read or written. The syscall acts as if it takes a copy of the message.

Linux has a lot of syscalls that are not message-passing because the object they operate on is not specified explicitly through a reference that authorises use of the object (such as a file descriptor). Instead they operate using the process's ambient authority. Examples:

  • open(), stat(), etc.: These operate on the file namespace (a combination of the process's root directory, current directory, and mount table; and for /proc the contents of the file namespace is also influenced by the process's identity).
  • kill(), ptrace(): These operate on the process ID namespace. (Unlike file descriptors, process IDs are not strong references. The mapping from process IDs to processes is ambiently available.)
  • mmap(), mprotect(): These operate on the process's address space, which is not a first class object.
Here are some advantages of implementing syscalls on top of a message-passing construct:
  1. It allows syscalls to be intercepted.

    Suppose that open() were just a library call implemented using sendmsg()/recvmsg() (as in Plash). It would send a message to a file namespace object (named via a file descriptor). This object can be replaced in order to tame the huge amount of authority that open() usually provides.

  2. It allows syscalls to be disabled.

    open() could be disabled by providing a file namespace object that doesn't implement an open() method, or by not providing a file namespace object.

  3. It can avoid race conditions in filtering syscalls.

    In the past people have attempted to use ptrace() to sandbox processes and give them limited access to the filesystem, by checking syscalls such as open() and allowing them through selectively (Subterfugue is an example). This is difficult or impossible to do securely because of a TOCTTOU race condition. open() doesn't take a filename; it takes an address, in the current process's address space, of a filename. It is not enough to catch the start of the open() syscall, check the filename, and allow the syscall through. Another thread might change the filename in the mean time. (This is aside from the race conditions involved in interpreting symlinks.)

    Systrace went to some trouble to copy filenames in the kernel to allow a tracing process to see and provide a consistent snapshot. This would have been less ad-hoc if the kernel had a uniform message-passing system.

    See "Exploiting Concurrency Vulnerabilities in System Call Wrappers".

  4. It aids logging of syscalls.

    On Linux, strace needs to have logic for interpreting every single syscall, because each syscall passes arguments in different ways, including how it reads and writes memory and the file descriptor table.

    If all syscalls went through a common message-passing interface, strace would only need one piece of logic for recording what was read or written. Furthermore, logging could be separated from decoding and formatting (such as turning syscall numbers into names).

  5. It allows consistency of code paths in the kernel, avoiding bugs.

    Mac OS X had a vulnerability in the TIOCGWINSZ ioctl(), which reads the width and height of a terminal window. The bug was that it would write directly to the address provided by the process, without checking whether the address was valid. This allowed any process to take over the kernel by writing to kernel memory.

    This wouldn't happen if ioctl() were message-passing, because all writing to the process's address space would be done in one place, in the syscall's return path. Forgetting the check would be much less likely.

    This bug demonstrates why ioctl() is dangerous. ioctl() should really be considered as a (huge) family of syscalls, not a single syscall, because each ioctl number (such as TIOCGWINSZ) can read or write address space, and sometimes the file descriptor table, in a different way.

  6. It enables implementations of interfaces to be moved between the kernel and userland.

    If the mechanism used to talk to the kernel is the same as the mechanism used to talk to other userland processes, processes should be agnostic as to whether the interfaces they use are implemented by the kernel.

    For example, NLTSS allowed the filesystem to be in-kernel (faster) or in a userland process (more robust and secure). So it was possible to flip a switch to trade off speed and robustness.

  7. It allows implementations of interfaces to be in-process too.

    This allows further performace tradeoffs. The pathname lookup logic of open() can be moved between the calling process and a separate process. For speed, pathname lookup can be placed in the process that implements the filesystem (as in Plash currently) in order to avoid doing a cross-process call for each pathname element. Alternatively, pathname lookup can be done in libc (as in the Hurd).

  8. It can help with versioning of system interfaces.

    Stable interfaces are nice, but the ability to evolve interfaces is nice too.

    Using object-based message-passing interfaces instead of raw syscalls can help with that. You can introduce new objects, or add new methods to existing objects. Old, obsolete interfaces can be defined in terms of new interfaces, and transparently implemented outside the kernel. New interfaces can be exposed selectively rather than system-wide.

  9. It does not have to hurt performance.

    Objects can still be implemented in the kernel. For example, in EROS (and KeyKOS/CapROS/Coyotos), various object types are implemented by the kernel, but are invoked through the same capability invocation mechanism as userland-implemented objects.

    Object invocations can be synchronous (call-return). They do not have to go via an asynchronous message queue. The kernel can provide a send-message-and-wait-for-reply syscall that is equivalent to a sendmsg()+recvmsg() combo but faster. L4 and EROS provide syscalls like this.