Tuesday, 10 March 2015

The DRAM rowhammer bug is exploitable

I've been researching the DRAM rowhammer issue and its security implications for a while. We've finally published our findings on the Project Zero blog: Exploiting the DRAM rowhammer bug to gain kernel privileges.

Wednesday, 21 January 2015

Conditionalising C/C++ code: "#ifdef FOO" vs. "#if FOO"

Is it better to use #ifdef PLATFORM or #if PLATFORM when writing code that needs to be conditionalised according to OS, CPU architecture, etc.?

Chromium's codebase uses the former. For example, it uses #ifdef OS_WIN or #if defined(OS_WIN), where OS_WIN is #defined on Windows (by build/build_config.h) and undefined elsewhere.

In contrast, NaCl's codebase uses the latter. It uses #if NACL_WINDOWS or if (NACL_WINDOWS), where NACL_WINDOWS is always #defined: as 1 on Windows and 0 elsewhere.

This latter approach has the benefit of catching some mistakes:

  • If you mistype the macro name in #if NACL_WINDOWS, you can get a compiler warning.

    However, this only works if you enable GCC/Clang's -Wundef warning. Microsoft's compiler (MSVC) has a similar warning, /wd4668, but it's effectively unusable because it produces warnings about system header files.

  • You can sometimes write if (PLATFORM) instead of #if PLATFORM. The if() version has the advantage that the code in the if() block will be compile-tested on all platforms, not just those where PLATFORM == 1. This can help catch mistakes earlier.

    This doesn't work if the code block uses functions that are only defined on PLATFORM, though.

See also: The Great -Wundef purge

Wednesday, 23 July 2014

Implementing fork() on the Mill CPU

The Mill is a new CPU architecture that claims to provide high performance but at a much better performance-per-watt than conventional CPUs that use out-of-order execution.

The Mill achieves this by making various architectural simplifications. One of those is to remove the TLB from the fast path of memory accesses. Rather than having the TLB between the CPU core and the cache, the Mill's TLB is between the cache and main memory.

OS models

This means the Mill is best suited for running Single Address Space operating systems (SASOS). The intent is that different processes will live at different addresses within a shared 64-bit address space. A process runs with permissions to access restricted ranges of this address space. Switching between processes is therefore just a matter of switching those permissions, which is fast on the Mill. It doesn't involve an expensive flush of the TLB (as on a conventional OS). It doesn't involve flushing the Mill's virtual-address-tagged (VIVT) cache.

This runs into a problem if we want to run Unix programs that use fork(), though. Use of fork() assumes that multiple processes will want to use the same virtual addresses.

The Mill developers have said they have a scheme for handling fork(), but they haven't said what it is, so I'm going to speculate. :-)

Context switching

If you create forked processes that share a range of virtual address space, a Mill OS can just flush the TLB and cache for those ranges whenever it needs to context switch between the two processes.

Basically, a Mill OS can act like a Separate Address Space OS when handling forked processes.

Switching costs

How expensive that would be depends on how the TLB and cache work.

  • In conventional CPUs, the TLB is per-core. The Mill's TLB might be shared between cores, though, if the Mill has higher-level caches that are both virtually-tagged and shared between cores.

    If that's the case, forked processes wouldn't be able to run concurrently on multiple cores.

  • Flushing the TLB and/or cache when context switching between forked processes would be slow.

    However, the OS would only have to flush the address ranges that have diverged between the forked processes -- that is, those that have been written to (as copy-on-write) or mmap()'d.

This would be fine for most uses of fork() on Unix, where execve() is called shortly after fork(). Forked processes usually exist only briefly, so there's little need to context-switch between them.


Forked processes are sometimes used as an optimisation, though. Android and Linux Chrome fork many child processes from a parent "zygote" process. This is a way to speed up process creation and reduce memory usage by sharing pages. This technique would become a pessimisation on the Mill. Android and Chrome would have to find a different way to do this.

Mill portals

Implementing fork() on the Mill doesn't seem very difficult, then.

However, there is one Mill feature that fork() might interfere with.

Mill provides "portals", a mechanism for one protection domain (a process in the single 64-bit address space) to call into another. A Mill OS would have to ensure that a forked process can't be invoked directly via a portal if that forked process has been context-switched out. The portal would have to be redirected to a routine that waits for the forked process to be context-switched back in before invoking it. Alternatively, a Mill OS could just disallow invoking forked process via portals entirely.

This doesn't seem like a big problem. Portals are a Mill-specific feature for optimising inter-process calls, whereas fork() is a "legacy" Unix feature. We want fork() for running existing Unix programs, which won't need to define Mill portals.

There might be an issue if a portal invocation needs to return to a forked process. For example, Unix syscalls might be implemented as portal invocations. Maybe these invocations would need to go via a proxy that knows how to context-switch the calling process back in.

Wednesday, 26 March 2014

How to do history-sensitive merges in Git

Merging in Git is usually not history-sensitive. By this I mean: if you're merging branches A and B together, Git looks at the content at the tips of branches A and B, and the content of the common ancestor commit(s) of A and B, but it doesn't look at any commits inbetween. Git just does a 3-way merge. This can make merging more painful than it needs to be.

History-sensitive merging is a merge algorithm that does look at the commit history. It can automatically resolve conflicts in cases where non-history-sensitive merging won't. It can also present conflicts more readably: It can reduce the size of a conflict and show which upstream change conflicted with your local change.

This is a short "how to" for doing a history-sensitive merge in Git.

My case study is: Merging LLVM 3.4 into PNaCl's branch of LLVM, which I did recently.

The problem

PNaCl has a branch of LLVM with various patches applied. (As an aside, we should really upstream those patches, to reduce the difficulty of merging from upstream.) Before the merge, this branch was based on LLVM 3.3 (commit $FROM). We wanted to upgrade it to being based on LLVM 3.4 (commit $TO).

Simply doing

git merge $TO

produced about 160 conflicts, as counted by "git grep '<<<<<<<' | wc -l".

A large number of those conflicts occurred because pnacl-llvm contained various changes cherry-picked from upstream LLVM. Often those changes touched code that was later refactored upstream. That will produce a conflict if we do a 3-way merge.

For example, pnacl-llvm cherry-picked various changes that add test files with "CHECK:" lines. Upstream later refactored these to "CHECK-LABEL:". If we do "git merge $TO", Git will see that the two branches added a file with the same name but different contents. Bam: that's a conflict. Git's 3-way merge doesn't look at the history, so it doesn't notice that both branches contain identical commits that add the file with a "CHECK:" line, and just one of the two branches later modifies that file.

A solution

We can do better just by telling Git to merge each upstream LLVM change one by one:

set -eu  # Stop on error
commits="$(git rev-list --reverse $FROM..$TO)"
for commit in $commits; do
  echo Merging $commit
  git merge --no-edit $commit

When this reaches a cherry-picked change, it should merge it with no conflict, and later merge the CHECK-LABEL refactoring with no conflict.

When this does hit a conflict -- i.e. when an upstream change conflicts with a local change -- it shows which upstream change conflicted. This provides more context for resolving the conflict.

This script is restartable. After you've resolved a conflict and git commit'd the resolution, it's fine to re-run the script. It will skip over the already-merged upstream changes.

This approach generates many small merges, usually with small conflicts, which can be resolved incrementally and committed to a branch. That is often easier than resolving a large set of conflicts resulting from one big merge, where it's also difficult to save or reuse a work-in-progress conflict resolution.

The one-by-one merge approach has some disadvantages. If a change is committed and reverted upstream, and conflicts with your local branch, you'll be prompted to resolve the conflict twice, back and forth. If you notice the change is committed, reverted and then re-applied upstream, it can be quicker to manually git merge the re-applied version.

Going faster

The script above goes quite slowly because it merges each change in turn. An optimisation is: explicitly merge only those upstream changes that modify files that our local branch also modifies (since all other upstream changes will be trivially merged).

set -eu  # Stop on error
files="$(git diff --name-only $FROM origin/master)"
# Relatively slow
git rev-list --reverse $FROM..$TO $files > tmp_commits_to_merge
echo $TO >> tmp_commits_to_merge  # Don't miss this commit

for commit in $(cat tmp_commits_to_merge); do
  echo Merging $commit
  git merge --no-edit $commit

Cleaning up

This approach produces a lot of Git merge commits. If you don't want to keep this history, you can squash the merges down into a single merge commit:

tree=$(git log --pretty=format:%T --no-walk HEAD)
commit=$(git commit-tree -m 'Commit message for merge...' \
    -p $(git merge-base HEAD origin/master) \
    -p $(git merge-base HEAD upstream/master) \
git push . $commit:my-merge

I believe the term "history-sensitive merge" comes from Darcs, a version control system which implements history-sensitive merging.

Wednesday, 7 August 2013

Handling crashes on Mac OS X: ordering of Mach exceptions versus POSIX signals

Mac OS X is a curious operating system because its kernel is derived from two kernel codebases -- the Mach kernel and a BSD kernel -- that have been glued together.

From these two ancestors, OS X inherits two different mechanisms for processes to handle hardware exceptions (a.k.a. faults):

  • POSIX signals:
    • In-process only.
    • Registered per-process.
    • The handler is always called on the thread that produced the fault.
  • Mach exceptions:
    • Allow both in-process and out-of-process handling.
    • Can be registered per-thread and per-process.
    • The handler is invoked via Mach RPC.

On OS X, it's possible for a process to have both POSIX signal handlers and Mach exception handlers registered.

It's not immediately obvious which of the two handlers will take priority and get invoked first, but experimentation shows that it's the Mach handler. If a process faults with a memory access error, the Mach exception handler for EXC_BAD_ACCESS gets invoked first, and if this handler returns KERN_FAILURE, the POSIX signal handler for SIGBUS will then be invoked.

However, that's not the full story.

OS X has a built-in Crash Reporter service which will create a crash dump if an application crashes and pop up a dialog box. Crash Reporter works via Mach exception handling and runs out-of-process. An application will normally have Crash Reporter's crash handler registered as one of its default Mach exception handlers. But what stops this handler from interfering with normal use of POSIX signals, if Mach exception handlers take priority over POSIX signals?

The answer is that OS X's kernel has three steps for handling a hardware exception:

  1. First-chance Mach exception handler: The kernel tries to invoke the Mach exception handler registered for the type of fault that occurred, e.g. EXC_BAD_ACCESS. If there's no handler registered, it skips this step. If the handler returns KERN_SUCCESS, the kernel resumes the thread that faulted.

  2. POSIX signal handler: If the Mach exception handler was absent or returned KERN_FAILURE, the kernel tries to invoke the POSIX signal handler for the type of fault that occurred, e.g. SIGBUS.

  3. Second-chance Mach exception handler: If invoking the SIGBUS handler fails (either because no SIGBUS handler is registered, or because writing the signal frame failed, or because the signal is blocked by the thread's signal mask), then the kernel starts to terminate the process, and it tries to invoke the Mach exception handler registered for EXC_CRASH.

    Crash Reporter's handler is registered via EXC_CRASH.

    Since EXC_CRASH is generated on the code path for process termination, EXC_CRASH cannot be handled in-process. At this point, the process has been partially shut down. Its threads are not running any more, but some of its state has not been destroyed yet and so can be read by a crash handler.

    If you try to handle EXC_CRASH in-process, the process will just hang. That's presumably because the Mach RPC for invoking the EXC_CRASH handler gets sent, and the thread that would receive it has been suspended, but the RPC doesn't fail because the process's Mach ports have not been freed yet.

    abnormal_exit_notify() in osfmk/kern/exception.c is what implements sending EXC_CRASH.

    Core dumps are handled by the same code path for abnormal process termination.

EXC_CRASH seems to have been added in OS X 10.5 ("Leopard"):

"On Tiger and earlier, the CrashReporter would generate a crash report for a Mach exception even if it was later delivered as a signal and handled by the process. Since Leopard, though, a crash report is only generated if the signal is not handled. So, handling signals is sufficient to suppress crash reports these days." (Ken Thomases, September 2009)

Chromium has some code to do just that. Its DisableOSCrashDumps() function will register a signal handler in order to suppress Crash Reporter's crash reports. Its signal handler calls _exit() to prevent Crash Reporter's EXC_CRASH handler from kicking in. Another way to do this would have been to set the process's exception port for EXC_CRASH to MACH_PORT_NULL.

Saturday, 29 June 2013

Simplifying LLVM IR for PNaCl

Lately I've been working on Portable Native Client ("PNaCl" for short).

Native Client (NaCl) is a sandboxing system that allows safe execution of native code in a web browser -- typically C/C++ code compiled to x86-32, x86-64 or ARM native code (with some support for MIPS too). The problem with NaCl has always been that you have to compile your program multiple times, once for each target architecture. PNaCl aims to solve that, so that your program can be compiled once to an architecture-neutral executable which is then translated to x86 or ARM inside the web browser.

PNaCl is based on LLVM, and PNaCl's executable format is based on LLVM's IR ("Intermediate Representation") language and its binary encoding, LLVM bitcode. LLVM's IR language is quite complex, so we're pruning out and expanding out a lot of features in order to ensure that PNaCl will be maintainable in the long term and to reduce executable sizes.

As an example, consider this fragment of C code:

    struct foo {
      int x;
      int y;

    void func(struct foo *ptr) {
      ptr->y = 123;

Normally in LLVM this compiles to the following LLVM assembly code:

    %struct.foo = type { i32, i32 }

    define void @func(%struct.foo* nocapture %ptr) #0 {
      %y = getelementptr inbounds %struct.foo* %ptr, i32 0, i32 1
      store i32 123, i32* %y, align 4, !tbaa !0
      ret void

    attributes #0 = { nounwind }
    !0 = metadata !{metadata !"int", metadata !1}

With PNaCl, we strip out the types, names and metadata so that this becomes:

    define internal void @67(i32) {
      %2 = add i32 %0, 4
      %3 = inttoptr i32 %2 to i32*
      store i32 123, i32* %3, align 1
      ret void

The definition of the struct type goes away. The "getelementptr" instruction is how LLVM handles pointer arithmetic for indexing into structs and arrays -- this gets expanded out and is replaced with pointer arithmetic on integers. The "%struct.foo*" pointer type is replaced with the i32 type. The "inttoptr" instruction remains as a vestige of LLVM's type system: every "load" or "store" instruction in a PNaCl executable is preceded by an "inttoptr" so that the program remains well-typed by LLVM's rules.

This is actually rather similar to what Emscripten and asm.js do. Like PNaCl, Emscripten is a system for running C/C++ programs in the web browser, but it compiles programs down to Javascript rather than using LLVM bitcode as an interchange format. Both PNaCl and Emscripten use Clang as a C/C++ front end, and run LLVM optimisation passes on the resulting LLVM IR code. In compiling to Javascript, Emscripten performs similar simplification steps for lowering IR, and asm.js codifies the resulting Javascript subset.

Saturday, 29 September 2012

Native Client's NTDLL patch on x86-64 Windows

Last year, I found a security hole in Native Client on 64-bit Windows that could be used to escape from the Native Client sandbox. Fortunately I found the hole before Native Client was enabled by default in Chrome. I recently wrote up the document below to explain the bug and how we fixed it.

Native Client currently relies on a small, process-local, in-memory patch to NTDLL on 64-bit versions of Windows in order to prevent a problem that would create a hole in Native Client's x86-64 sandbox.

The problem

The problem arises because Windows does not have an equivalent of Unix's sigaltstack() system call.

On Unix, a POSIX signal handler may be registered using signal() or sigaction(). When a process faults (e.g. as a result of a memory access error or an illegal instruction), the kernel passes control to the signal handler that is registered for the process. Normally, the signal handler is run on the stack of the thread that faulted. However, if an "alternate signal stack" has been registered using sigaltstack(), the signal handler will run on that stack instead.

On Windows, "vectored exception handlers" are similar to Unix signal handlers, except that a vectored exception handler always gets run on the thread's current stack. This might have been OK, except that Windows has a different division of responsibility between kernel and userland compared with Unix. Windows does more in userland, and this creates problems for NaCl.

On Unix, sigaction() is a syscall that user code calls that records a user-code signal handler function in a kernel data structure. If no signal handler is registered by user code, then when a fault occurs in a process, the kernel never passes control back to userland code in that process.

On Windows, AddVectoredExceptionHandler() is provided in userland by a DLL rather than being provided by the kernel. AddVectoredExceptionHandler() adds the handler function to a list in userland memory. When a fault occurs in the process, the Windows kernel passes control to a function called KiUserExceptionDispatcher in NTDLL, in userland. KiUserExceptionDispatcher checks whether a handler function was registered via AddVectoredExceptionHandler() (or via various other mechanisms), and if so, it calls the handler; otherwise, it terminates the process. The kernel therefore calls KiUserExceptionDispatcher regardless of whether any vectored exception handlers were registered.

This is a problem for NaCl because when we are running sandboxed code, %rsp points into untrusted address space. If sandboxed code faults on a bad memory access or a HLT instruction (which it is allowed to do), Windows calls KiUserExceptionDispatcher, which then runs on the untrusted stack. Since NaCl allows multi-threading, other sandboxed threads can modify the untrusted stack that KiUserExceptionDispatcher is running on.

Furthermore, KiUserExceptionDispatcher is a complex function which calls other NTDLL functions internally. These functions return by using the RET instruction to pop a return address off the stack, which means that control flow is determined by values stored on the untrusted stack. An attacker running as sandboxed code in another thread can modify the memory location containing a return address and so cause the NTDLL routine to jump to an address of the attacker's choice. This allows the attacker to cause a jump to a non-bundle-aligned address, which allows the attacker to escape from the NaCl sandbox.

The fix: patching NTDLL

Our fix is to patch the KiUserExceptionDispatcher entry point inside NTDLL so that this routine will safely terminate the process.

Our patching is process-local: it does not affect the NTDLL.DLL that other processes use. We only modify the in-memory copy of NTDLL.DLL that is loaded into the NaCl process.

Patching is necessary because the address of KiUserExceptionDispatcher is fixed in the Windows kernel at boot time. We don't have any way to tell the kernel to use a routine at a different address. The kernel randomizes the address of NTDLL.DLL at boot time, but thereafter NTDLL.DLL is loaded at the same address in all processes and the address of KiUserExceptionDispatcher is fixed.

What do we do to terminate the process safely? While we could overwrite the start of KiUserExceptionDispatcher with a HLT instruction, this instruction will simply cause the kernel to re-run KiUserExceptionDispatcher again, because the kernel does not track whether a thread is currently handling an exception. This would cause the kernel to repeatedly push a stack frame containing saved register state and re-call KiUserExceptionDispatcher until the thread runs out of stack. Although this will eventually terminate the process, it would be slower than we would like.

So instead, the original version of our patch overwrites the start of KiUserExceptionDispatcher with:

  mov $0, %esp

We set the stack pointer to zero first so that the kernel can't write another stack frame. If a fault occurs when the kernel is unable to write a stack frame to memory, we observe that the kernel does not attempt to call KiUserExceptionDispatcher and instead it just terminates the process.

Breakpad crash reporting

We extended the patch to support crash reporting for crashes in trusted code. (In Chrome, this uses the Breakpad crash reporting system.)

The extended version of the patched KiUserExceptionDispatcher looks at a thread-local variable to check whether the fault occurred inside trusted code or sandboxed code. If trusted code faulted, we know we are safely running on a trusted stack, and we pass control back to the original unpatched version of KiUserExceptionDispatcher (which eventually causes the Breakpad crash handler to run and report the crash). Otherwise, we assume %rsp points to an unsafe untrusted stack, and we terminate the process using the same "mov $0, %esp; hlt" sequence as before.

x86-32 Windows

The problem does not arise on 32-bit versions of Windows because of the way NaCl's x86-32 sandbox uses segment registers.

On 32-bit Windows, when sandboxed code faults, the kernel notices that the %ss register contains a non-default value, and it terminates the process without attempting to write a stack frame or call KiUserExceptionDispatcher.


Our NTDLL patch has a couple of limitations:

  • We are not using a stable interface. KiUserExceptionDispatcher is a Windows-internal interface between the Windows kernel and NTDLL, and it might change in future versions of Windows. We are relying on undocumented behaviour.
  • We don't control what data the Windows kernel writes to the untrusted stack. This data is momentarily visible to other sandboxed threads before the process is terminated, and it will be easily visible in embeddings of NaCl that support shared memory and multiple processes. We don't think the kernel writes any sensitive data in this stack frame, and seems unlikely that it would do so in the future, but it's still possible.

Possible alternatives

There are a couple of alternatives to patching NTDLL:

  • Using the Windows debugging API: It is possible to catch a fault in sandboxed code before KiUserExceptionDispatcher is run using the Windows debugging API. We use this API for implementing untrusted hardware exception handling.

    However, using this API comes at a cost: if a process has a debugger attached, Windows will temporarily suspend the whole process each time it creates or exits a thread. This would slow down thread creation and teardown.

    In practice, we wouldn't want to completely replace the NTDLL patch with use of the Windows debugging API. For defence-in-depth, we would want to use both.

  • Changing the sandbox model: We could change the sandbox model so that %rsp does not point into untrusted address space. One option is to use a different register as the untrusted stack pointer, and disallow use of CALL, PUSH and POP. Another option would be to have %rsp point to a thread-private stack (outside of the sandbox's usual address space). Both options would be quite invasive changes to the sandbox model.


Windows runs an internal function on the userland stack whenever a hardware exception occurs. This behaviour does not appear to be documented anywhere, which suggests that Windows treats this use of the userland stack as an implementation detail rather than part of the interface. For most applications, this use of the userland stack does not matter. Native Client is unusual in that this stack usage creates a hole in NaCl's sandbox, which we have to work around.

In addition to being undocumented, Windows' use of the userland stack can be surprising to people from a Unix background, given that Unix signal handlers behave differently. However, when one considers the complexity of Windows' exception handling interfaces -- which include Structured Exception Handlers, Vectored Exception Handlers and UnhandledExceptionFilters -- it makes more sense that this complex logic would live in userland.

When it comes to the kernel/userland boundary, what can be an implementation detail for one person can be an important interface detail for another. The lesson is that these details are often critical for the security of a sandbox.


The original sandbox hole: issue 1633 (filed in April 2011 and fixed in June 2011)

Breakpad crash reporting support for 64-bit Windows: issue 2095

Untrusted hardware exception handling for 64-bit Windows: issue 2536

Documentation for AddVectoredExceptionHandler() on MSDN