Lacking Rhoticity: 2010

Tuesday, 21 December 2010

A common misconception about the Chrome sandbox

A common misconception about the Chrome web browser is that its sandbox protects one web site from another.

For example, suppose you are logged into your e-mail account on mail.com in one tab, and have evil.com open in another tab. Suppose evil.com finds an exploit in the renderer process, such as a memory safety bug, that lets it run arbitrary code there. Can evil.com get hold of your HTTP cookies for mail.com, and thereby access your e-mail account?

Unfortunately, the answer is yes.

The reason is that mail.com and evil.com can be assigned to the same renderer process. The browser does not only do this to save memory. evil.com can cause this to happen by opening an iframe on mail.com. With mail.com's code running in the same exploited renderer process, evil.com can take it over and read the cookies for your mail.com account and use them for its own ends.

There are a couple of reasons why the browser puts a framed site in the same renderer process as the parent site. Firstly, if the sites were handled by separate processes, the browser would have to do costly compositing across renderer processes to make the child frame appear inside the parent frame. Secondly, in some cases the DOM allows Javascript objects in one frame to obtain references to DOM objects in other frames, even across origins, and it is easier for this to be managed within one renderer process.

I don't say this to pick on Chrome, of course. It is better to have the sandbox than not to have it.

Chrome has never claimed that the sandbox protects one site against another. In the tech report "The Security Architecture of the Chromium Browser" (Barth, Jackson, Reis and the Chrome Team; 2008), "Origin isolation" is specifically listed under "Out-of-scope goals". They state that "an attacker who compromises the rendering engine can act on behalf of any web site".

There are a couple of ways that web sites and users can mitigate this problem, which I'll discuss in another post. However, in the absence of those defences, what Chrome's multi-process architecture actually gives you is the following:

Robustness if a renderer crashes. Having multiple renderer processes means that a crash of one takes down only a limited number of tabs, and the browser and the other renderers will survive. It also helps memory management.
But we can get this without sandboxing the renderers.
Protection of the rest of the user's system from vulnerabilities in the renderer process. For example, the sandboxed renderer cannot read any of the user's files, except for those the user has granted through a "File Upload" file chooser.
But we can get this by sandboxing the whole browser (including any subprocesses), without needing to have the browser separated from the renderer.
For example, since 2007 I have been running Firefox under Plash (a sandbox), on Linux.
In principle, such a sandbox should be more effective at protecting applications and files outside the browser than the Chrome sandbox, because the sandbox covers all of the browser, including its network stack and the so-called browser "chrome" (this means the parts of the GUI outside of the DOM).
In practice, Plash is not complete as a sandbox for GUI apps because it does not limit access to the X Window System, so apps can do things that X allows such as screen scraping other apps and sending them input.

The main reason Chrome was developed to sandbox its renderer processes but not the whole browser is that this is easier to implement with sandboxing technologies that are easily deployable today. Ideally, though, the whole browser would be sandboxed. One of the only components that would stay unsandboxed, and have access to all the user's files, would be the "File Open" dialog box for choosing files to upload.

Saturday, 18 December 2010

When printf debugging is a luxury

Inserting printf() calls is often considered to be a primitive fallback when other debugging tools are not available, such as stack backtraces with source line numbers.

But there are some situations in low-level programming where most libc calls don't work and so even printf() and assert() are unavailable luxuries. This can happen:

when libc is not properly initialised yet;
when we writing code that is called by libc and cannot re-enter libc code;
when we are in a signal handler;
when only limited stack space is available;
when we cannot allocate memory for some reason; or
when we are not even linked to libc.

Here's a fragment of code that has come in handy in these situations. It provides a simple assert() implementation:

#include <string.h>
#include <unistd.h>

static void debug(const char *msg) {
  write(2, msg, strlen(msg));
}

static void die(const char *msg) {
  debug(msg);
  _exit(1);
}

#define TO_STRING_1(x) #x
#define TO_STRING(x) TO_STRING_1(x)

#define assert(expr) {                                                        \
  if (!(expr)) die("assertion failed at " __FILE__ ":" TO_STRING(__LINE__)    \
                   ": " #expr "\n"); }

By using preprocessor trickery to construct the assertion failure string at compile time, it avoids having to format the string at runtime. So it does not need to allocate memory, and it doesn't need to do multiple write() calls (which can become interleaved with other output in the multi-threaded case).

Sometimes even libc's write() is a luxury. In some builds of GNU libc on Linux, glibc's syscall wrappers use the TLS register (%gs on i386) to fetch the address of a routine for making syscalls.

However, if %gs is not set up properly for some reason, this will fail. For example, for Native Client's i386 sandbox, %gs is set to a different value whenever sandboxed code is running, and %gs stays in this state if sandboxed code faults and triggers a signal handler. In Chromium's seccomp-sandbox, %gs is set to zero in the trusted thread.

In those situations we have to bypass libc and do the system calls ourselves. The following snippet comes from reference_trusted_thread.cc. The sys_*() functions are defined by linux_syscall_support.h, which provides wrappers for many Linux syscalls:

#include "linux_syscall_support.h"

void die(const char *msg) {
  sys_write(2, msg, strlen(msg));
  sys_exit_group(1);
}

Thursday, 4 November 2010

An introduction to FreeBSD-Capsicum

In my last blog post, I described one of the features in FreeBSD-Capsicum: process descriptors. Now it's time for an overview of Capsicum.

Capsicum is a set of new features for FreeBSD that adds better support for sandboxing, using a capability model in which the capabilities are Unix file descriptors (FDs).

Capsicum takes a fairly conservative approach, in that it does not make operations on file descriptors virtualisable. This approach has some limitations -- we do not get the advantages of having purely message-passing syscalls. However, it does mean that the new features are orthogonal.

The main new features are:

A per-process "capability mode", which is turned on via a new cap_enter() syscall.
This mode disables any system call that provides ambient authority. So it disables system calls that use global namespaces, including the file namespace (e.g. open()), the PID namespace (e.g. kill()) and the network address namespace (e.g. connect()).
This is not just a syscall filter, though. Some system calls optionally use a global namespace. For example, sendmsg() and sendto() optionally take a socket address. For openat(), the directory FD can be omitted. Capability mode disables those cases.
Furthermore, capability mode disallows the use of ".." (parent directory) in filenames for openat() and the other *at() calls. This changes directory FDs to be limited-authority objects that convey access to a specific directory and not the whole filesystem. (It is interesting that this appears to be a property of the process, via capability mode, rather than of the directory FD itself.)
Capability mode is inherited across fork and exec.
Finer-grained permissions for file descriptors. Each FD gets a large set of permission bits. A less-permissive copy of an FD can be created with cap_new(). For example, you can have read-only directory FDs, or non-seekable FDs for files.
Process descriptors. Capsicum doesn't allow kill() inside the sandbox because kill() uses a global namespace (the PID namespace). So Capsicum introduces process descriptors (a new FD type) as a replacement for process IDs, and adds pdfork(), pdwait() and pdkill() as replacements for fork(), wait() and kill().

Plus there are a couple of smaller features:

Message-based sockets. The Capsicum guys implemented Linux's SOCK_SEQPACKET interface for FreeBSD.
An fexecve() system call which takes a file descriptor for an executable. This replaces execve(), which is disabled in capability mode because execve() takes a filename.
Capsicum's fexecve() ignores the implicit filename that is embedded in the executable's PT_INTERP field, so it is only good for loading the dynamic linker directly or for loading other statically linked executables.

Currently, the only programs that run under Capsicum are those that have been ported specially:

The Capsicum guys ported Chromium, and it works much the same way as on Linux. On both systems, Chromium's renderer process runs sandboxed, but the browser process does not. On both systems, Chromium needs to be able to turn on sandboxing after the process has started up, because it relies on legacy libraries that use open() during startup.
Some Unix utilities, including gzip and dhclient, have been extended to use sandboxing internally (privilege separation). Like Chromium, gzip can open files and then switch to capability mode.

However, it should be possible to run legacy Unix programs under Capsicum by porting Plash.

At first glance, it looks like Plash would have to do the same tricks under FreeBSD-Capsicum as it does under Linux to run legacy programs. Under Linux, Plash uses a modified version of glibc in order to intercept its system calls and convert them to system calls that work in the sandbox. That's because the Linux kernel doesn't provide any help with intercepting the system calls. The situation is similar under FreeBSD -- Capsicum does not add any extensions for bouncing syscalls back to a user space handler.

However, there are two aspects of FreeBSD that should make Plash easier to implement there than on Linux:

FreeBSD's libc is friendlier towards overriding its functions. On both systems, it is possible to override (for example) open() via an LD_PRELOAD library that defines its own "open" symbol. But with glibc on Linux, this doesn't work for libc's internal calls to open(), such as from fopen(). For a small gain in efficiency, these calls don't go through PLT entries and so cannot be intercepted.
FreeBSD's libc doesn't use this optimisation and so it allows the internal calls to be intercepted too.
FreeBSD's dynamic linker and libc are not tightly coupled, so it is possible to change the dynamic linker to open its libraries via IPC calls without having to rebuild libc in lockstep.
In contrast, Linux glibc's ld.so and libc.so are built together, share some data structures (such as TLS), and cannot be replaced independently.

Saturday, 23 October 2010

Process descriptors in FreeBSD-Capsicum

Capsicum is a set of new features for FreeBSD that adds better support for sandboxing, adding a capability mode in which the capabilities are Unix file descriptors (FDs). The features Capsicum adds are orthogonal, which is nice. One of the new features is process descriptors.

Capsicum adds a replacement for fork() called pdfork(), which returns a process descriptor (a new type of FD) rather than a PID. Similarly, there are replacements for wait() and kill() -- pdwait() and pdkill() -- which take FDs as arguments instead of PIDs.

The reason for the new interface is that kill() is not safe to allow in Capsicum's sandbox, because it provides ambient authority: it looks up its PID argument in a global namespace.

But even if you ignore sandboxing issues, this new interface is a significant improvement on POSIX process management:

It allows the ability to wait on a process to be delegated to another process. In contrast, with wait()/waitpid(), a process's exit status can only be read by the process's parent.
Process descriptors can be used with poll(). This avoids the awkwardness of having to use SIGCHLD, which doesn't work well if multiple libraries within the same process want to wait() for child processes.
It gets rid of the race condition associated with kill(). Sending a signal to a PID is dodgy because the original process with this PID could have exited, and the kernel could have recycled the PID for an unrelated process, especially on a system where processes are spawned and exit frequently.
kill() is only really safe when used by a parent process on its child, and only when the parent makes sure to use it before wait() has returned the child's exit status. pdkill() gets rid of this problem.
In future, process descriptors can be extended to provide access to the process's internal state for debugging purposes, e.g. for reading registers and memory, or modifying memory mappings or the FD table. This would be an improvement on Linux's ptrace() interface.

However, there is one aspect of Capsicum's process descriptors that I think was a mistake: Dropping the process descriptor for a process causes the kernel to kill it. (By this I mean that if there are no more references to the process descriptor, because they have been close()'d or because the processes holding them have exited, the kernel will terminate the process.)

The usual principle of garbage collection in programming languages is that GC should not affect the observable behaviour of the program (except for resource usage). Capsicum's kill-on-close() behaviour violates that principle.

kill-on-close() is based on the assumption that the launcher of a subprocess always wants to check the subprocess's exit status. But this is not always the case. Exit status is just one communications channel, but not one we always care about. It's common to launch a process to communicate with it via IPC without wanting to have to wait() for it. (Double fork()ing is a way to do this in POSIX without leaving a zombie process.) Many processes are fire-and-forget.

The situation is analogous for threads. pthreads provides pthread_detach() and PTHREAD_CREATE_DETACHED as a way to launch a thread without having to pthread_join() it.

In Python, when you create a thread, you get a Thread object back, but if the Thread object is GC'd the thread won't be killed. In fact, finalisation of a thread object is implemented using pthread_detach() on Unix.

I know of one language implementation that (as I recall) will GC threads, Mozart/Oz, but this only happens when a thread can have no further observable effects. In Oz, if a thread blocks waiting for the resolution of a logic variable that can never be resolved (because no other thread holds a reference to it), then the thread can be GC'd. So deadlocked threads can be GC'd just fine. Similarly, if a thread holds no communications channels to the outside world, it can be GC'd safely.

Admittedly, Unix already violates this GC principle by the finalisation of pipe FDs and socket FDs, because dropping one endpoint of the socket/pipe pair is visible as an EOF condition on the other endpoint. However, this is usually used to free up resources or to unblock a process that would otherwise fail to make progress. Dropping a process descriptor would halt progress -- the opposite. Socket/pipe EOF can be used to do this too, but it is less common, and I'm not sure we should encourage this.

However, aside from this complaint, process descriptors are a good idea. It would be good to see them implemented in Linux.

Wednesday, 11 August 2010

My workflow with git-cl + Rietveld

Git's model of changes (which is shared by Mercurial, Bazaar and Monotone) makes it awkward to revise earlier patches. This can make things difficult when you are sending out multiple, dependent changes for code review.

Suppose I create changes A and B. B depends functionally on A, i.e. tests will not pass for B without A also being applied. There might or might not be a textual dependency (B might or might not modify lines of code modified by A).

Because code review is slow (high latency), I need to be able to send out changes A and B for review and still be able to continue working on further changes. But I also need to be able to revisit A to make changes to it based on review feedback, and then make sure B works with the revised A.

What I do is create separate branches for A and B, where B branches off of A. To revise change A, I "git checkout" its branch and add further commits. Later I can update B by checking it out and rebasing it onto the current tip of A. Uploading A or B to the review system or committing A or B upstream (to SVN) involves squashing their branch's commits into one commit. (This squashing means the branches contain micro-history that reviewers don't see and which is not kept after changes are pushed upstream.)

The review system in question is Rietveld, the code review web app used for Chromium and Native Client development. Rietveld does not have any special support for patch series -- it is only designed to handle one patch at a time, so it does not know about dependencies between changes. The tool for uploading changes from Git to Rietveld and later committing them to SVN is "git-cl" (part of depot_tools).

git-cl is intended to be used with one branch per change-under-review. However, it does not have much support for handling changes which depend on each other.

This workflow has a lot of problems:

When using git-cl on its own, I have to manually keep track that B is to be rebased on to A. When uploading B to Rietveld, I must do "git cl upload A". When updating B, I must first do "git rebase A". When diffing B, I have to do "git diff A". (I have written a tool to do this. It's not very good, but it's better than doing it manually.)
Rebasing B often produces conflicts if A has been squash-committed to SVN. That's because if branch A contained multiple patches, Git doesn't know how to skip over patches from A that are in branch B.
Rebasing loses history. Undoing a rebase is not easy.
In the case where B doesn't depend on A, rebasing branch B so that it doesn't include the contents of branch A is a pain. (Sometimes I will stack B on top of A even when it doesn't depend on A, so that I can test the changes together. An alternative is to create a temporary branch and "git merge" A and B into it, but creating further branches adds to the complexity.)
If there is a conflict, I don't find out about it until I check out and update the affected branch.
This gets even more painful if I want to maintain changes that are not yet ready for committing or posting for review, and apply them alongside changes that are ready for review.

There are all reasons why I would not recommend this workflow to someone who is not already very familiar with Git.

The social solution to this problem would be for code reviews to happen faster, which would reduce the need to stack up changes. If all code reviews reached a conclusion within 24 hours, that would be an improvement. But I don't think that is going to happen.

The technical solution would be better patch management tools. I am increasingly thinking that Darcs' set-of-patches model would work better for this than Git's DAG-of-commits model. If I could set individual patches to be temporarily applied or unapplied to the working copy, and reorder and group patches, I think it would be easier to revisit changes that I have posted for review.

Friday, 6 August 2010

CVS's problems resurface in Git

Although modern version control systems have improved a lot on CVS, I get the feeling that there is a fundamental version control problem that the modern VCSes (Git, Mercurial, Bazaar, and I'll include Subversion too!) haven't solved. The curious thing is that CVS had sort of made some steps towards addressing it.

In CVS, history is stored per file. If you commit a change that crosses multiple files, CVS updates each file's history separately. This causes a bunch of problems:

CVS does not represent changesets or snapshots as first class objects. As a result, many operations involve visiting every file's history.
Reconstructing a changeset involves searching all files' histories to match up the individual file changes. (This was just about possible, though I hear there are tricky corner cases. Later CVS added a commit ID field that presumably helped with this.)
Creating a tag at the latest revision involves adding a tag to every file's history. Reconstructing a tag, or a time-based snapshot, involves visiting every file's history again.
CVS does not represent file renamings, so the standard history tools like "cvs log" and "cvs annotate" are not able to follow a file's history from before it was renamed.

In the DAG-based decentralised VCSes (Git, Mercurial, Monotone, Bazaar), history is stored per repository. The fundamental data structure for history is a Directed Acyclic Graph of commit objects. Each commit points to a snapshot of the entire file tree plus zero or more parent commits. This addresses CVS's problems:

Extracting changesets is easy because they are the same thing as commit objects.
Creating a tag is cheap and easy. Recording any change creates a commit object (a snapshot-with-history), so creating a tag is as simple as pointing to an already-existing commit object.

However, often it is not practical to put all the code that you're interested in into a single Git repository! (I pick on Git here because, of the DAG-based systems, it is the one I am most familar with.) While it can be practical to do this with Subversion or CVS, it is less practical with the DAG-based decentralised VCSes:

In the DAG-based systems, branching is done at the level of a repository. You cannot branch and merge subdirectories of a repository independently: you cannot create a commit that only partially merges two parent commits.
Checking out a Git repository involves downloading not only the entire current revision, but the entire history. So this creates pressure against putting two partially-related projects together in the same repository, especially if one of the projects is huge.
Existing projects might already use separate repositories. It is usually not practical to combine those repositories into a single repository, because that would create a repo that is incompatible with the original repos. That would make it difficult to merge upstream changes. Patch sharing would become awkward because the filenames in patches would need fixing.

This all means that when you start projects, you have to decide how to split your code among repositories. Changing these decisions later is not at all straightforward.

The result of this is that CVS's problems have not really been solved: they have just been pushed up a level. The problems that occurred at the level of individual files now occur at the level of repositories:

The DAG-based systems don't represent changesets that cross repositories. They don't have a type of object for representing a snapshot across repositories.
Creating a tag across repositories would involve visiting every repository to add a tag to it.
There is no support for moving files between repositories while tracking the history of the file.

The funny thing is that since CVS hit this problem all the time, the CVS tools were better at dealing with multiple histories than Git.

To compare the two, imagine that instead of putting your project in a single Git repository, you put each one of the project's files in a separate Git repository. This would result in a history representation that is roughly equivalent to CVS's history representation. i.e. Every file has its own separate history graph.

To check in changes to multiple files, you have to "cd" to each file's repository directory, and "git commit" and "git push" the file change.
To update to a new upstream version, or to switch branch, you have to "cd" to each file's repository directory again to do "git pull/fetch/rebase/checkout" or whatever.
Correlating history across files must be done manually. You could run "git log" or "gitk" on two repositories and match up the timelines or commit messages by hand. I don't know of any tools for doing this.

In contrast, for CVS, "cvs commit" works across multiple files and (if I remember rightly) even across multiple working directories. "cvs update" works across multiple files.

While "cvs log" doesn't work across multiple files, there is a tool called "CVS Monitor" which reconstructs history and changesets across files.

Experience with CVS suggests that Git could be changed to handle the multiple-repository case better. "git commit", "git checkout" etc. could be changed to operate across multiple Git working copies. Maybe "git log" and "gitk" could gain options to interleave histories by timestamp.

Of course, that would lead to cross-repo support that is only as good as CVS's cross-file support. We might be able to apply a textual tag name across multiple Git repos with a single command just as a tag name can be applied across files with "cvs tag". But that doesn't give us an immutable tag object than spans repos.

My point is that the fundamental data structure used in the DAG-based systems doesn't solve CVS's problem, it just postpones it to a larger level of granularity. Some possible solutions to the problem are DEPS files (as used by Chromium), Git submodules, or Darcs-style set-of-patches repos. These all introduce new data structures. Do any of these solve the original problem? I am undecided -- this question will have to wait for another post. :-)

Wednesday, 5 May 2010

The trouble with Buildbot

The trouble with Buildbot is that it encourages you to put rules into a Buildbot-specific build configuration that is separate from the normal configuration files that you might use to build a project (configure scripts, makefiles, etc.).

This is not a big problem if your Buildbot configuration is simple and just consists of, say, "svn up", "./configure", "make", "make test", and never changes.

But it is a problem if your Buildbot configuration becomes non-trivial and ever has to be updated, because the Buildbot configuration cannot be tested outside of Buildbot.

The last time I had to maintain a Buildbot setup, it was necessary to try out configuration changes directly on the Buildbot master. This doesn't work out well if multiple people are responsible for maintaining the setup! Whoever makes a change has to remember to check it in to version control after they've got it working, which of course doesn't always happen. It's a bit ironic that Buildbot is supposed to support automated testing but doesn't follow best practices for testing itself.

There is a simple way around this though: Instead of putting those separate steps -- "./configure", "make", "make test" -- into the Buildbot config, put them into a script, check the script into version control, and have the Buildbot config run that script. Then the Buildbot config just consists of doing "svn up" and running the script. It is then possible to test changes to the script before checking it in. I've written scripts like this that go as far as debootstrapping a fresh Ubuntu chroot to run tests in, which ensures your package dependency list is up to date.

Unfortunately, Buildbot's logging facilities don't encourage having a minimal Buildbot config.

If you use a complicated Buildbot configuration with many Buildbot steps, Buildbot can display each step separately in its HTML-formatted logs. This means:

you can see progress;
you can see which steps have failed;
you'd be able to see how long the steps take if Buildbot actually displayed that.

Whereas if you have one big build script in a single Buildbot step, all the output goes into one big, flat, plain text log file.

I think the solution is to decouple the structured-logging functionality from the glorified-cron functionality that Buildbot provides. My implementation of this is build_log.py (used in Plash's build scripts), which I'll write more about later.

Saturday, 1 May 2010

Breakpoints in gdb using int3

Here is a useful trick I discovered recently while debugging some changes to the seccomp sandbox. To trigger a breakpoint on x86, just do:

__asm__("int3");

Then it is possible to inspect registers, memory, the call stack, etc. in gdb, without having to get gdb to set a breakpoint. The instruction triggers a SIGTRAP, so if the process is not running under gdb and has no signal handler for SIGTRAP, the process will die.

This technique seems to be fairly well known, although it's not mentioned in the gdb documentation. int3 is the instruction that gdb uses internally for setting breakpoints.

Sometimes it's easier to insert an int3 and rebuild than get gdb to set a breakpoint. For example, setting a gdb breakpoint on a line won't work in the middle of a chunk of inline assembly.

My expectations of gdb are pretty low these days. When I try to use it to debug something low level, it often doesn't work, which is why I have been motivated to hack together my own debugging tools in the past. For example, if I run gdb on the glibc dynamic linker (ld.so) on Ubuntu Hardy or Karmic, it gives:

$ gdb /lib/ld-linux-x86-64.so.2 
...
(gdb) run
Starting program: /lib/ld-linux-x86-64.so.2 
Cannot access memory at address 0x21ec88
(gdb)

So it's nice to find a case where I can get some useful information out of gdb.

Monday, 1 February 2010

How to build adb, the Android debugger

adb is the Android debugger (officially the "Android debug bridge" I think). It is a tool for getting shell access to an Android phone across a USB connection. It can also be used to copy files to and from the Android device and do port-forwarding. In short, it is similar to ssh, but is not ssh. (Why couldn't they have just used ssh?)

I have not been able to find any Debian/Ubuntu packages for adb. The reason why it has not been packaged becomes apparent if you try to build the thing. Android has a monolithic build system which wants to download a long list of Git repositories and build everything. If you follow the instructions, it will download 1.1Gb from Git and leave you with a source+build directory of 6Gb. It isn't really designed for building subsets of components, unlike, say, JHBuild. It's basically a huge makefile. It doesn't know about dependencies between components. However, it does have some idea about dependencies between output files.

Based on a build-of-all-Android, I figured out how to build a much smaller subset containing adb. This downloads a more manageable 11Mb and finishes with a source+build directory containing 40Mb. This is also preferable to downloading the pre-built Android SDK, which has a non-free licence.

Instructions:

$ sudo apt-get install build-essential libncurses5-dev
$ git clone git://android.git.kernel.org/platform/system/core.git system/core
$ git clone git://android.git.kernel.org/platform/build.git build
$ git clone git://android.git.kernel.org/platform/external/zlib.git external/zlib
$ git clone git://android.git.kernel.org/platform/bionic.git bionic
$ echo "include build/core/main.mk" >Makefile

Now edit build/core/main.mk and comment out the parts labelled

 # Check for the correct version of java

and

 # Check for the correct version of javac

Since adb doesn't need Java, these checks are unnecessary.

Also edit build/target/product/sdk.mk and comment out the "include" lines after

 # include available languages for TTS in the system image

I don't know exactly what this is about but it avoids having to download language files that aren't needed for adb. Then building the adb target should work:

make out/host/linux-x86/bin/adb

If you try running "adb shell" you might get this:

ubuntu$ ./out/host/linux-x86/bin/adb shell
* daemon not running. starting it now *
* daemon started successfully *
error: insufficient permissions for device

So you probably need to do "adb start-server" as root first:

ubuntu$ sudo ./out/host/linux-x86/bin/adb kill-server
ubuntu$ sudo ./out/host/linux-x86/bin/adb start-server
* daemon not running. starting it now *
* daemon started successfully *
ubuntu$ ./out/host/linux-x86/bin/adb shell
$

For the record, here are the errors I got that motivated each step:

make: *** No rule to make target `external/svox/pico/lang/PicoLangItItInSystem.mk'.  Stop.

- hence commenting out the picolang includes.

system/core/libzipfile/zipfile.c:6:18: error: zlib.h: No such file or directory

- I'm guessing adb needs libzipfile which needs zlib.

```
system/core/libcutils/mspace.c:59:50: error: ../../../bionic/libc/bionic/dlmalloc.c: No such file or directory
```
- This is why we need to download bionic (the C library used on Android), even though we aren't building any code to run on an Android device. This is the ugliest part and it illustrates why this is not a modular build system. The code does
```
#include "../../../bionic/libc/bionic/dlmalloc.c"
```
to #include a file from another module. It seems any part of the build can refer to any other part, via relative pathnames, so the modules cannot be build separately. I don't know whether this is an isolated case, but it makes it difficult to put adb into a Debian package.
```
host Executable: adb (out/host/linux-x86/obj/EXECUTABLES/adb_intermediates/adb)
/usr/bin/ld: cannot find -lncurses
```
- hence the ncurses-dev dependency above. However, this error is a little odd because if adb really depended on ncurses, it would fail when it tried to #include a header file. Linking with "-lncurses" is probably superfluous.

The instructions above will probably stop working as new versions are pushed to the public Git branches. (However, this happens infrequently because Android development is not done in the open.) For reproducibility, here are the Git commit IDs:

$ find -name "*.git" -exec sh -c 'echo "`git --git-dir={} rev-parse HEAD` {}"' ';'
91a54c11cbfbe3adc1df2f523c75ad76affb0ae9 ./system/core/.git
95604529ec25fe7923ba88312c590f38aa5e3d9e ./bionic/.git
890bf930c34d855a6fbb4d09463c1541e90381d0 ./external/zlib/.git
b7c844e7cf05b4cea629178bfa793321391d21de ./build/.git

It looks like the current version is Android 1.6 (Donut):

$ find -name "*.git" -exec sh -c 'echo "`git --git-dir={} describe` {}"' ';'
android-1.6_r1-80-g91a54c1 ./system/core/.git
android-1.6_r1-43-g9560452 ./bionic/.git
android-1.6_r1-7-g890bf93 ./external/zlib/.git
android-sdk-1.6-docs_r1-65-gb7c844e ./build/.git

Monday, 25 January 2010

Why system calls should be message-passing

Message-passing syscalls on Linux include read(), write(), sendmsg() and recvmsg(). These are message-passing because:

They take a file descriptor as an explicit argument. This specifies the object to send a message to or receive a message from.
The message to send (or receive) consists of an array of bytes, and maybe an array of file descriptors too (via SCM_RIGHTS). The syscall interacts with the process's address space (or file descriptor table) in a well-defined, uniform way. The caller specifies which locations are read or written. The syscall acts as if it takes a copy of the message.

Linux has a lot of syscalls that are not message-passing because the object they operate on is not specified explicitly through a reference that authorises use of the object (such as a file descriptor). Instead they operate using the process's ambient authority. Examples:

open(), stat(), etc.: These operate on the file namespace (a combination of the process's root directory, current directory, and mount table; and for /proc the contents of the file namespace is also influenced by the process's identity).
kill(), ptrace(): These operate on the process ID namespace. (Unlike file descriptors, process IDs are not strong references. The mapping from process IDs to processes is ambiently available.)
mmap(), mprotect(): These operate on the process's address space, which is not a first class object.

Here are some advantages of implementing syscalls on top of a message-passing construct:

It allows syscalls to be intercepted.
Suppose that open() were just a library call implemented using sendmsg()/recvmsg() (as in Plash). It would send a message to a file namespace object (named via a file descriptor). This object can be replaced in order to tame the huge amount of authority that open() usually provides.
It allows syscalls to be disabled.
open() could be disabled by providing a file namespace object that doesn't implement an open() method, or by not providing a file namespace object.
It can avoid race conditions in filtering syscalls.
In the past people have attempted to use ptrace() to sandbox processes and give them limited access to the filesystem, by checking syscalls such as open() and allowing them through selectively (Subterfugue is an example). This is difficult or impossible to do securely because of a TOCTTOU race condition. open() doesn't take a filename; it takes an address, in the current process's address space, of a filename. It is not enough to catch the start of the open() syscall, check the filename, and allow the syscall through. Another thread might change the filename in the mean time. (This is aside from the race conditions involved in interpreting symlinks.)
Systrace went to some trouble to copy filenames in the kernel to allow a tracing process to see and provide a consistent snapshot. This would have been less ad-hoc if the kernel had a uniform message-passing system.
See "Exploiting Concurrency Vulnerabilities in System Call Wrappers".
It aids logging of syscalls.
On Linux, strace needs to have logic for interpreting every single syscall, because each syscall passes arguments in different ways, including how it reads and writes memory and the file descriptor table.
If all syscalls went through a common message-passing interface, strace would only need one piece of logic for recording what was read or written. Furthermore, logging could be separated from decoding and formatting (such as turning syscall numbers into names).
It allows consistency of code paths in the kernel, avoiding bugs.
Mac OS X had a vulnerability in the TIOCGWINSZ ioctl(), which reads the width and height of a terminal window. The bug was that it would write directly to the address provided by the process, without checking whether the address was valid. This allowed any process to take over the kernel by writing to kernel memory.
This wouldn't happen if ioctl() were message-passing, because all writing to the process's address space would be done in one place, in the syscall's return path. Forgetting the check would be much less likely.
This bug demonstrates why ioctl() is dangerous. ioctl() should really be considered as a (huge) family of syscalls, not a single syscall, because each ioctl number (such as TIOCGWINSZ) can read or write address space, and sometimes the file descriptor table, in a different way.
It enables implementations of interfaces to be moved between the kernel and userland.
If the mechanism used to talk to the kernel is the same as the mechanism used to talk to other userland processes, processes should be agnostic as to whether the interfaces they use are implemented by the kernel.
For example, NLTSS allowed the filesystem to be in-kernel (faster) or in a userland process (more robust and secure). So it was possible to flip a switch to trade off speed and robustness.
It allows implementations of interfaces to be in-process too.
This allows further performace tradeoffs. The pathname lookup logic of open() can be moved between the calling process and a separate process. For speed, pathname lookup can be placed in the process that implements the filesystem (as in Plash currently) in order to avoid doing a cross-process call for each pathname element. Alternatively, pathname lookup can be done in libc (as in the Hurd).
It can help with versioning of system interfaces.
Stable interfaces are nice, but the ability to evolve interfaces is nice too.
Using object-based message-passing interfaces instead of raw syscalls can help with that. You can introduce new objects, or add new methods to existing objects. Old, obsolete interfaces can be defined in terms of new interfaces, and transparently implemented outside the kernel. New interfaces can be exposed selectively rather than system-wide.
It does not have to hurt performance.
Objects can still be implemented in the kernel. For example, in EROS (and KeyKOS/CapROS/Coyotos), various object types are implemented by the kernel, but are invoked through the same capability invocation mechanism as userland-implemented objects.
Object invocations can be synchronous (call-return). They do not have to go via an asynchronous message queue. The kernel can provide a send-message-and-wait-for-reply syscall that is equivalent to a sendmsg()+recvmsg() combo but faster. L4 and EROS provide syscalls like this.

Lacking Rhoticity