Saturday, 19 November 2011

Stack unwinding risks on 64-bit Windows

Recently, I've been looking at how x86-64 Windows does stack unwinding in 64-bit processes, and I've found some odd behaviour. If the stack unwinder finds a return address on the stack that does not have associated unwind info, it applies a fallback unwind rule that does not make much sense.

I've been wondering if this could make some x86-64 programs more easily exploitable if they corrupt the stack or if they use x86-64 assembly code that does not have unwind info.

In pseudocode, the current unwind logic looks something like this:

void unwind_stack(struct register_state regs) {
  while (???) {
    unwind_info = get_unwind_info(regs.rip);
    if (unwind_info) {
      if (has_exception_handler(unwind_info) {
        // Run exception handler...
      }
      regs = unwind_stack_frame(unwind_info, regs);
    }
    else {
      // Fallback case for leaf functions:
      regs.rip = *(uint64_t *) regs.rsp;
      regs.rsp += 8;
    }
  }
}

The issue is that the fallback case only makes sense for the first iteration of the loop. The fact that it is applied to later iterations is probably just sloppy programming.

The fallback case is intended to handle "leaf functions". This means functions that:

  1. do not adjust %rsp, and
  2. do not call other functions.

These two properties are related: if a function calls other functions, it must adjust %rsp first, otherwise it does not conform to the x86-64 ABI.

Since the fallback case is applied repeatedly, the unwinder will happily interpret the stack as a series of return addresses with no gaps between them:

     ...
     8 bytes   return address
     8 bytes   return address
     8 bytes   return address
     ...

However, those are not valid stack frames in the x86-64 Windows ABI. A valid stack frame for a non-leaf function Foo() (i.e. a function that calls other functions) looks like this:

  ------------ 16-byte aligned
    32 bytes   "shadow space" (scratch space for Foo())
     8 bytes   return address (points into Foo()'s caller)
     8 bytes   scratch space for Foo()
  16*n bytes   scratch space for Foo() (for some n >= 0)
    32 bytes   "shadow space" (scratch space for Foo()'s callees)
  ------------ 16-byte aligned

This layout comes from two requirements in the Windows x86-64 ABI:

  1. %rsp must be 8mod16 on entry to a function. (This means %rsp should be 16-byte aligned before a CALL instruction.) This requirement is common to Windows and Linux.
  2. The caller must also reserve 32 bytes of "shadow space" on the stack above the return address, which the callee can use as scratch space. (The x86-64 ABI on Linux does not have this requirement.)

This means that a function that does this:

  bar:
    call qux
    ret
cannot be valid, because:
  • it does not align the stack correctly on entry to qux();
  • more seriously, it does not allocate shadow space, so the return address that bar()'s RET instruction jumps to could have been overwritten by qux().

Yet the Windows unwinder treats the resulting stack frame as valid.

The upshot of the the fallback rule is that when the unwinder reaches a return address that does not have unwind info, it will scan up the stack looking for a value that does, looking at each 64-bit value in turn. The unwinder seems to lack basic sanity checking, so it does not stop even if it hits a zero value (which clearly cannot be a valid return address).

This has a tendency to mask mistakes. If you have an assembly function with incorrect or missing unwind info, the unwinder will tend to scan past uninitialised values or scratch data on the stack and recover at a higher-up stack frame.

The risky part is that the unwinder will be interpreting values as return addresses when they aren't return addresses at all. If the unwinder hits an address whose unwind info has an associated exception handler, the unwinder will jump to the handler. This is extra risky because language-level exceptions (e.g. for C++) and hardware exceptions are handled by the same mechanism on Windows, known as Structured Exception Handling (SEH). Both result in stack unwinding on x86-64. This means that null pointer dereferences trigger stack unwinding. This is unlike Linux, where hardware exceptions don't normally trigger stack unwinding.

This means that an attacker might be able to do the following:

  • find a function F containing an exception handler that does something interesting;
  • guess the address of the function F (more specifically, the address inside F that's inside F's __try block);
  • find another function G with incorrect or missing unwind info;
  • arrange for the address of F to be in G's stack frame;
  • cause a hardware exception (e.g. a null pointer dereference) to occur while G is on the stack;
  • and therefore cause the exception handler in F to get run.

The x86-64 version of MinGW (a GNU toolchain for Windows) is susceptible to this, because its version of GCC doesn't generate unwind info. Here's an example of how a null pointer dereference can cause an exception handler to be run:

prog.c (compile with x86-64 MinGW's gcc):

#include <stdio.h>
#include <stdlib.h>

long long get_addr();

void exc_handler() {
  printf("in exception handler!\n");
  exit(0);
}

void my_func(long long x) {
  // This might be necessary to force x to be written to the stack:
  // printf("stack=%p\n", &x);

  // This should normally kill the process safely, but it doesn't here:
  *(int *) 0 = 0;
}

int main() {
  my_func(get_addr());
  return 0;
}

get_unwind_ret.asm (assemble with Microsoft's x86-64 assembler):

PUBLIC get_addr
EXTRN exc_handler:PROC

_TEXT SEGMENT

; Helper function for getting the address inside the function below.
get_addr PROC
        lea rax, ret_addr
        ret
get_addr ENDP

; Innocent function that uses an exception handler.
blah PROC FRAME :exc_handler
.endprolog
ret_addr LABEL PTR
        hlt
blah ENDP

_TEXT ENDS

END

Here's what I get on Cygwin:

$ ml64 /c get_unwind_ret.asm
$ x86_64-w64-mingw32-gcc get_unwind_ret.obj prog.c -o prog
$ ./prog
in exception handler!

This works because GCC generates the following code without unwind info:

$ x86_64-w64-mingw32-gcc prog.c -S -o -
...
my_func:
        pushq %rbp
        movq %rsp, %rbp
        movq %rcx, 16(%rbp)
        movl $0, %eax
        movl $0, (%rax)
        leave
        ret
...

my_func()'s argument is written to the shadow space above the return address that points to main(), but since main() doesn't have unwind info either, the spilled argument gets interpreted as a return address by the unwinder.

Some conclusions:

  • Be extra careful if you're writing x86-64 assembly code on Windows.
  • Be careful if you're using GCC to generate x86-64 code on Windows and check whether it is producing unwind info. (There is no problem on x86-32 because Windows does not use stack unwinding for exception handling on x86-32.)
  • Windows' x86-64 stack unwinder should be changed to be stricter. It should stop if it encounters a return address without unwind info.

Thursday, 17 November 2011

ARM cache flushing & doubly-mapped pages

If you're familiar with the ARM architecture you'll probably know that self-modifying code has to be careful to flush the instruction cache on ARM. (Back in the 1990s, the introduction of the StrongARM with its split instruction and data caches broke a lot of programs on RISC OS.)

On ARM Linux there's a syscall, cacheflush, for flushing the instruction cache for a range of virtual addresses.

This syscall works fine if you map some code as RWX (read+write+execute) and execute it from the same virtual address that you use to modify it. This is how JIT compilers usually work.

In the Native Client sandbox, though, for dynamic code loading support, we have code pages that are mapped twice, once as RX (read+execute) and once as RW (read+write). Naively you'd expect that after you write instructions to the RW address, you just have to call cacheflush on the RX address. However, that's not enough. cacheflush doesn't just clear the i-cache. It also flushes the data cache to ensure that writes are committed to memory so that they can be read back by the i-cache. The two parts are clear if you look at the kernel implementation of cacheflush, which does two different MCRs on the address range.

I guess the syscall interface was not designed with double-mapped pages in mind, since it doesn't allow the i-cache and d-cache to be flushed separately. For the time being, Native Client will have to call cacheflush on both the RW and RX mappings.

See NaCl issue 2443 for where this came up.

Tuesday, 23 August 2011

Fixing the trouble with Buildbot

Last year I wrote a blog post, "The trouble with Buildbot", about how Buildbot creates a dilemma for complex projects because it forces you to choose between two ways of describing a project's build steps:
  • You can describe build steps in the Buildbot config. Buildbot configs are awkward to update -- someone has to restart the Buildbot master -- and hard to test, but you get the benefit that the build steps appear as separate steps in Buildbot's display.
  • You can write a script which runs the build steps directly, and check it into the same repository as your project. This is easier to maintain and test, but traditionally all the output from the script would appear as a single Buildbot build step, making the output hard to read.

Fortunately, Brad Nelson has addressed this problem with an extension to Buildbot known as "Buildbot Annotations". The Python code for this currently lives in chromium_step.py (see AnnotatedCommand).

The idea is that your checked-in script will run multiple steps sequentially but output tags between them (e.g. "@@@BUILD_STEP tests@@@") so that the output can be parsed into chunks by the Buildbot master, and displayed as separate chunks.

For example, an early version of Native Client's Annotations-based buildbot script looked something like this:

...
echo @@@BUILD_STEP gyp_compile@@@
make -C .. -k -j12 V=1 BUILDTYPE=${GYPMODE}

echo @@@BUILD_STEP scons_compile${BITS}@@@
./scons -j 8 -k --verbose ${GLIBCOPTS} --mode=${MODE}-host,nacl \
    platform=x86-${BITS}

echo @@@BUILD_STEP small_tests${BITS}@@@
./scons -k --verbose ${GLIBCOPTS} --mode=${MODE}-host,nacl small_tests \
    platform=x86-${BITS} ||
    { RETCODE=$? && echo @@@STEP_FAILURE@@@;}
...

(More recently, this shell script has been replaced with a Python script.)

You can see this in use on the Native Client Buildbot page (and also on the trybot page, though that's less readable).

The logic for running NaCl's many build steps -- including a clobber step, a Scons build, a Gyp build, small_tests, medium_tests, large_tests, chrome_browser_tests etc. -- used to live in the Buildbot config, and we'd usually have to get Brad Nelson to change it on our behalf. Brad would have to restart the buildbot masters manually, and this would halt any builds that were in progress, including trybot jobs.

Now the knowledge of these build steps has moved into scripts that are checked into the native_client repo, which can easily be updated. We can change the scripts at the same time as changing other code, with an atomic SVN commit. Changes can be tested via the trybots.

Chromium is not using Buildbot Annotations yet for its buildbot, but it would be good to switch it over. One obstacle is timeout handling. Buildbot's build steps can have separate timeouts, and the Buildbot build slave is responsible for terminating a build step's subprocess(es) if they take too long. With Buildbot Annotations, the responsibility for doing per-step timeouts would move to the checked-in build script.

The current Annotations output format has some down sides:

  • The syntax is simple but kind of ugly.
  • It's not possible to nest build steps.
  • It's not possible to interleave output from two concurrent build steps.

Overall, Annotations reduces our dependence on Buildbot. If there were a simpler, more scalable alternative to Buildbot that also supported the Annotations format, we could easily switch to it because we our Buildbot config is not as complex as it used to be.

Thursday, 10 February 2011

Cookies versus the Chrome sandbox

Although Chrome's sandbox does not protect one web site from another in general, it can provide such protection in some cases. Those cases are ones in which HTTP cookies are either reduced in scope or not used at all. One lesson we could draw from this is that cookies reduce the usefulness of Chrome's sandbox.

The scenario we are exploring supposes that there is a vulnerability in Chrome's renderer process, and that the vulnerability lets a malicious site take control of the renderer process. This means that all the restrictions that are normally enforced on the malicious site by the renderer process are stripped away, and all we are left with are the restrictions enforced on the renderer process by the Chrome browser process and the Chrome sandbox.

In my previous blog post, I explained how an attacker site, evil.com, that manages to exploit the renderer process could steal the login cookies from another site, mail.com, and so gain access to the user's e-mail.

The attack is made possible by the combination of two features:

  1. cookies
  2. frames

Chrome currently runs a framed page in the same renderer process as the parent page. HTML standards allow framed pages to access cookies, so the browser process has to give the renderer process access to the cookies for both pages.

Because this problem arises from the interaction of these features, one site is not always vulnerable to other sites. There should be a couple of ways that users and sites can mitigate the problem, without changing Chrome. Firstly, the user can change how cookies are scoped within the browser by setting up multiple profiles. Secondly, a site can skirt around the problem by not using cookies at all. We discuss these possibilities below.

  • Use multiple profiles: As a user, you can create multiple browser profiles, and access mail.com and evil.com in separate profiles.

    Chrome does not make this very easy at the moment. It provides a command line option (--user-data-dir) for creating more profiles, but this feature is not available through the GUI. Chrome's GUI provides just two profiles: one profile (persistent) for normal windows, plus an extra profile (non-persistent) for windows in "incognito" mode.

    So, you could log in to mail.com in a normal window and view evil.com in an incognito window, or vice versa. This is safer because cookies registered by a web site in one profile are not accessible to sites visited in another profile. Each profile has a separate pool of cookies. This feature of browser profiles means you can log into one mail.com account in incognito mode and a different mail.com account in normal mode.

    It would be interesting to see if this profile separation could be automated.

  • Use web-keys instead of cookies: The developers of mail.com could defend against evil.com by not using cookies to track user logins. Instead mail.com could use web-keys. Logging in to mail.com would take you to a URL containing an unguessable token. Such a URL is known as a "web-key". (For technical reasons, the unguessable token should be in the "#" fragment part of the URL.)

    This is safer because even if evil.com compromises the renderer process, the Chrome browser process generally does not give the renderer process a way to enumerate other tabs, discover other tabs' URLs, or enumerate the user's browsing history and bookmarks.

    Using web-keys has been proposed before to address a different (but related) web security problem, clickjacking. (See Tyler Close's essay, The Confused Deputy Rides Again.)

    Using web-keys would change the user interface for mail.com a little. Cookies are the mechanism by which entering "mail.com" in the URL bar can take you directly to the inbox of the e-mail account you are logged in to. Whenever the browser sees "mail.com" as the domain name it automatically adds the cookies for "mail.com" to the HTTP request. (This automatic attachment of credentials makes this a type of ambient authority.) This mechanism adds some convenience for the user, but it is also a means by which evil.com can attack mail.com, because whenever evil.com links to mail.com, the cookies get added to the request too. The browser does not distinguish between a URL entered by the user in the address bar and a URL provided by another site like evil.com.

    So if mail.com were changed to remove its use of cookies, you would lose the small convenience of being able to get to your inbox directly by typing "mail.com" without re-logging-in. Is there a way to get this convenience back? An alternative to using cookies to record logins is to use bookmarks. If mail.com uses web-key URLs, you could bookmark the address of the inbox page. To get to your inbox without re-logging-in, you would select the bookmark.

    These days the Chrome address bar accepts more than just URLs (which is why the address bar is actually called the "omnibar"), so you could imagine that entering "mail.com" would search your bookmarks and jump to your bookmarked inbox page rather than being treated as the URL "http://mail.com".

Conclusions

There are a couple of possible conclusions we could draw from this:

  • Cookies weaken the Chrome sandbox.
  • Frames weaken the Chrome sandbox.

If we blame cookies, we should look critically at other browser features that behave similarly to cookies by providing ambient authority. For example, the new LocalFileSystem feature (an extension of the File API) provides local storage. The file store it provides is per-origin and so is based on ambient authority. If mail.com uses this to cache e-mail, and evil.com exploits the renderer process, then evil.com will be able to read and modify the cached e-mail. There are other local storage APIs (IndexedDB and Web Storage), but they are based on ambient authority too. From this perspective, the situation is getting worse.

If we blame frames, this suggests that browsers should fix the problem by implementing site isolation. Site isolation means that the browser would put a framed page in a different renderer process from a parent page. Microsoft's experimental Gazelle browser implements site isolation but breaks compatibility with the web. It remains to be seen whether a web browser can implement site isolation while retaining compatibility and good performance.

Either way, concerned users and web app authors need to know how browser features are implemented if they are to judge how much security the browser can provide. That's not easy, because the web platform is so complicated!

Tuesday, 21 December 2010

A common misconception about the Chrome sandbox

A common misconception about the Chrome web browser is that its sandbox protects one web site from another.

For example, suppose you are logged into your e-mail account on mail.com in one tab, and have evil.com open in another tab. Suppose evil.com finds an exploit in the renderer process, such as a memory safety bug, that lets it run arbitrary code there. Can evil.com get hold of your HTTP cookies for mail.com, and thereby access your e-mail account?

Unfortunately, the answer is yes.

The reason is that mail.com and evil.com can be assigned to the same renderer process. The browser does not only do this to save memory. evil.com can cause this to happen by opening an iframe on mail.com. With mail.com's code running in the same exploited renderer process, evil.com can take it over and read the cookies for your mail.com account and use them for its own ends.

There are a couple of reasons why the browser puts a framed site in the same renderer process as the parent site. Firstly, if the sites were handled by separate processes, the browser would have to do costly compositing across renderer processes to make the child frame appear inside the parent frame. Secondly, in some cases the DOM allows Javascript objects in one frame to obtain references to DOM objects in other frames, even across origins, and it is easier for this to be managed within one renderer process.

I don't say this to pick on Chrome, of course. It is better to have the sandbox than not to have it.

Chrome has never claimed that the sandbox protects one site against another. In the tech report "The Security Architecture of the Chromium Browser" (Barth, Jackson, Reis and the Chrome Team; 2008), "Origin isolation" is specifically listed under "Out-of-scope goals". They state that "an attacker who compromises the rendering engine can act on behalf of any web site".

There are a couple of ways that web sites and users can mitigate this problem, which I'll discuss in another post. However, in the absence of those defences, what Chrome's multi-process architecture actually gives you is the following:

  • Robustness if a renderer crashes. Having multiple renderer processes means that a crash of one takes down only a limited number of tabs, and the browser and the other renderers will survive. It also helps memory management.

    But we can get this without sandboxing the renderers.

  • Protection of the rest of the user's system from vulnerabilities in the renderer process. For example, the sandboxed renderer cannot read any of the user's files, except for those the user has granted through a "File Upload" file chooser.

    But we can get this by sandboxing the whole browser (including any subprocesses), without needing to have the browser separated from the renderer.

    For example, since 2007 I have been running Firefox under Plash (a sandbox), on Linux.

    In principle, such a sandbox should be more effective at protecting applications and files outside the browser than the Chrome sandbox, because the sandbox covers all of the browser, including its network stack and the so-called browser "chrome" (this means the parts of the GUI outside of the DOM).

    In practice, Plash is not complete as a sandbox for GUI apps because it does not limit access to the X Window System, so apps can do things that X allows such as screen scraping other apps and sending them input.

The main reason Chrome was developed to sandbox its renderer processes but not the whole browser is that this is easier to implement with sandboxing technologies that are easily deployable today. Ideally, though, the whole browser would be sandboxed. One of the only components that would stay unsandboxed, and have access to all the user's files, would be the "File Open" dialog box for choosing files to upload.

Saturday, 18 December 2010

When printf debugging is a luxury

Inserting printf() calls is often considered to be a primitive fallback when other debugging tools are not available, such as stack backtraces with source line numbers.

But there are some situations in low-level programming where most libc calls don't work and so even printf() and assert() are unavailable luxuries. This can happen:

  • when libc is not properly initialised yet;
  • when we writing code that is called by libc and cannot re-enter libc code;
  • when we are in a signal handler;
  • when only limited stack space is available;
  • when we cannot allocate memory for some reason; or
  • when we are not even linked to libc.

Here's a fragment of code that has come in handy in these situations. It provides a simple assert() implementation:

#include <string.h>
#include <unistd.h>

static void debug(const char *msg) {
  write(2, msg, strlen(msg));
}

static void die(const char *msg) {
  debug(msg);
  _exit(1);
}

#define TO_STRING_1(x) #x
#define TO_STRING(x) TO_STRING_1(x)

#define assert(expr) {                                                        \
  if (!(expr)) die("assertion failed at " __FILE__ ":" TO_STRING(__LINE__)    \
                   ": " #expr "\n"); }

By using preprocessor trickery to construct the assertion failure string at compile time, it avoids having to format the string at runtime. So it does not need to allocate memory, and it doesn't need to do multiple write() calls (which can become interleaved with other output in the multi-threaded case).

Sometimes even libc's write() is a luxury. In some builds of GNU libc on Linux, glibc's syscall wrappers use the TLS register (%gs on i386) to fetch the address of a routine for making syscalls.

However, if %gs is not set up properly for some reason, this will fail. For example, for Native Client's i386 sandbox, %gs is set to a different value whenever sandboxed code is running, and %gs stays in this state if sandboxed code faults and triggers a signal handler. In Chromium's seccomp-sandbox, %gs is set to zero in the trusted thread.

In those situations we have to bypass libc and do the system calls ourselves. The following snippet comes from reference_trusted_thread.cc. The sys_*() functions are defined by linux_syscall_support.h, which provides wrappers for many Linux syscalls:

#include "linux_syscall_support.h"

void die(const char *msg) {
  sys_write(2, msg, strlen(msg));
  sys_exit_group(1);
}

Thursday, 4 November 2010

An introduction to FreeBSD-Capsicum

In my last blog post, I described one of the features in FreeBSD-Capsicum: process descriptors. Now it's time for an overview of Capsicum.

Capsicum is a set of new features for FreeBSD that adds better support for sandboxing, using a capability model in which the capabilities are Unix file descriptors (FDs).

Capsicum takes a fairly conservative approach, in that it does not make operations on file descriptors virtualisable. This approach has some limitations -- we do not get the advantages of having purely message-passing syscalls. However, it does mean that the new features are orthogonal.

The main new features are:

  • A per-process "capability mode", which is turned on via a new cap_enter() syscall.

    This mode disables any system call that provides ambient authority. So it disables system calls that use global namespaces, including the file namespace (e.g. open()), the PID namespace (e.g. kill()) and the network address namespace (e.g. connect()).

    This is not just a syscall filter, though. Some system calls optionally use a global namespace. For example, sendmsg() and sendto() optionally take a socket address. For openat(), the directory FD can be omitted. Capability mode disables those cases.

    Furthermore, capability mode disallows the use of ".." (parent directory) in filenames for openat() and the other *at() calls. This changes directory FDs to be limited-authority objects that convey access to a specific directory and not the whole filesystem. (It is interesting that this appears to be a property of the process, via capability mode, rather than of the directory FD itself.)

    Capability mode is inherited across fork and exec.

  • Finer-grained permissions for file descriptors. Each FD gets a large set of permission bits. A less-permissive copy of an FD can be created with cap_new(). For example, you can have read-only directory FDs, or non-seekable FDs for files.
  • Process descriptors. Capsicum doesn't allow kill() inside the sandbox because kill() uses a global namespace (the PID namespace). So Capsicum introduces process descriptors (a new FD type) as a replacement for process IDs, and adds pdfork(), pdwait() and pdkill() as replacements for fork(), wait() and kill().

Plus there are a couple of smaller features:

  • Message-based sockets. The Capsicum guys implemented Linux's SOCK_SEQPACKET interface for FreeBSD.
  • An fexecve() system call which takes a file descriptor for an executable. This replaces execve(), which is disabled in capability mode because execve() takes a filename.

    Capsicum's fexecve() ignores the implicit filename that is embedded in the executable's PT_INTERP field, so it is only good for loading the dynamic linker directly or for loading other statically linked executables.

Currently, the only programs that run under Capsicum are those that have been ported specially:

  • The Capsicum guys ported Chromium, and it works much the same way as on Linux. On both systems, Chromium's renderer process runs sandboxed, but the browser process does not. On both systems, Chromium needs to be able to turn on sandboxing after the process has started up, because it relies on legacy libraries that use open() during startup.
  • Some Unix utilities, including gzip and dhclient, have been extended to use sandboxing internally (privilege separation). Like Chromium, gzip can open files and then switch to capability mode.

However, it should be possible to run legacy Unix programs under Capsicum by porting Plash.

At first glance, it looks like Plash would have to do the same tricks under FreeBSD-Capsicum as it does under Linux to run legacy programs. Under Linux, Plash uses a modified version of glibc in order to intercept its system calls and convert them to system calls that work in the sandbox. That's because the Linux kernel doesn't provide any help with intercepting the system calls. The situation is similar under FreeBSD -- Capsicum does not add any extensions for bouncing syscalls back to a user space handler.

However, there are two aspects of FreeBSD that should make Plash easier to implement there than on Linux:

  • FreeBSD's libc is friendlier towards overriding its functions. On both systems, it is possible to override (for example) open() via an LD_PRELOAD library that defines its own "open" symbol. But with glibc on Linux, this doesn't work for libc's internal calls to open(), such as from fopen(). For a small gain in efficiency, these calls don't go through PLT entries and so cannot be intercepted.

    FreeBSD's libc doesn't use this optimisation and so it allows the internal calls to be intercepted too.

  • FreeBSD's dynamic linker and libc are not tightly coupled, so it is possible to change the dynamic linker to open its libraries via IPC calls without having to rebuild libc in lockstep.

    In contrast, Linux glibc's ld.so and libc.so are built together, share some data structures (such as TLS), and cannot be replaced independently.