Wednesday, 23 July 2014

Implementing fork() on the Mill CPU

The Mill is a new CPU architecture that claims to provide high performance but at a much better performance-per-watt than conventional CPUs that use out-of-order execution.

The Mill achieves this by making various architectural simplifications. One of those is to remove the TLB from the fast path of memory accesses. Rather than having the TLB between the CPU core and the cache, the Mill's TLB is between the cache and main memory.

OS models

This means the Mill is best suited for running Single Address Space operating systems (SASOS). The intent is that different processes will live at different addresses within a shared 64-bit address space. A process runs with permissions to access restricted ranges of this address space. Switching between processes is therefore just a matter of switching those permissions, which is fast on the Mill. It doesn't involve an expensive flush of the TLB (as on a conventional OS). It doesn't involve flushing the Mill's virtual-address-tagged (VIVT) cache.

This runs into a problem if we want to run Unix programs that use fork(), though. Use of fork() assumes that multiple processes will want to use the same virtual addresses.

The Mill developers have said they have a scheme for handling fork(), but they haven't said what it is, so I'm going to speculate. :-)

Context switching

If you create forked processes that share a range of virtual address space, a Mill OS can just flush the TLB and cache for those ranges whenever it needs to context switch between the two processes.

Basically, a Mill OS can act like a Separate Address Space OS when handling forked processes.

Switching costs

How expensive that would be depends on how the TLB and cache work.

  • In conventional CPUs, the TLB is per-core. The Mill's TLB might be shared between cores, though, if the Mill has higher-level caches that are both virtually-tagged and shared between cores.

    If that's the case, forked processes wouldn't be able to run concurrently on multiple cores.

  • Flushing the TLB and/or cache when context switching between forked processes would be slow.

    However, the OS would only have to flush the address ranges that have diverged between the forked processes -- that is, those that have been written to (as copy-on-write) or mmap()'d.

This would be fine for most uses of fork() on Unix, where execve() is called shortly after fork(). Forked processes usually exist only briefly, so there's little need to context-switch between them.

Zygotes

Forked processes are sometimes used as an optimisation, though. Android and Linux Chrome fork many child processes from a parent "zygote" process. This is a way to speed up process creation and reduce memory usage by sharing pages. This technique would become a pessimisation on the Mill. Android and Chrome would have to find a different way to do this.

Mill portals

Implementing fork() on the Mill doesn't seem very difficult, then.

However, there is one Mill feature that fork() might interfere with.

Mill provides "portals", a mechanism for one protection domain (a process in the single 64-bit address space) to call into another. A Mill OS would have to ensure that a forked process can't be invoked directly via a portal if that forked process has been context-switched out. The portal would have to be redirected to a routine that waits for the forked process to be context-switched back in before invoking it. Alternatively, a Mill OS could just disallow invoking forked process via portals entirely.

This doesn't seem like a big problem. Portals are a Mill-specific feature for optimising inter-process calls, whereas fork() is a "legacy" Unix feature. We want fork() for running existing Unix programs, which won't need to define Mill portals.

There might be an issue if a portal invocation needs to return to a forked process. For example, Unix syscalls might be implemented as portal invocations. Maybe these invocations would need to go via a proxy that knows how to context-switch the calling process back in.

Wednesday, 26 March 2014

How to do history-sensitive merges in Git

Merging in Git is usually not history-sensitive. By this I mean: if you're merging branches A and B together, Git looks at the content at the tips of branches A and B, and the content of the common ancestor commit(s) of A and B, but it doesn't look at any commits inbetween. Git just does a 3-way merge. This can make merging more painful than it needs to be.

History-sensitive merging is a merge algorithm that does look at the commit history. It can automatically resolve conflicts in cases where non-history-sensitive merging won't. It can also present conflicts more readably: It can reduce the size of a conflict and show which upstream change conflicted with your local change.

This is a short "how to" for doing a history-sensitive merge in Git.

My case study is: Merging LLVM 3.4 into PNaCl's branch of LLVM, which I did recently.

The problem

PNaCl has a branch of LLVM with various patches applied. (As an aside, we should really upstream those patches, to reduce the difficulty of merging from upstream.) Before the merge, this branch was based on LLVM 3.3 (commit $FROM). We wanted to upgrade it to being based on LLVM 3.4 (commit $TO).

Simply doing

git merge $TO

produced about 160 conflicts, as counted by "git grep '<<<<<<<' | wc -l".

A large number of those conflicts occurred because pnacl-llvm contained various changes cherry-picked from upstream LLVM. Often those changes touched code that was later refactored upstream. That will produce a conflict if we do a 3-way merge.

For example, pnacl-llvm cherry-picked various changes that add test files with "CHECK:" lines. Upstream later refactored these to "CHECK-LABEL:". If we do "git merge $TO", Git will see that the two branches added a file with the same name but different contents. Bam: that's a conflict. Git's 3-way merge doesn't look at the history, so it doesn't notice that both branches contain identical commits that add the file with a "CHECK:" line, and just one of the two branches later modifies that file.

A solution

We can do better just by telling Git to merge each upstream LLVM change one by one:

set -eu  # Stop on error
commits="$(git rev-list --reverse $FROM..$TO)"
for commit in $commits; do
  echo Merging $commit
  git merge --no-edit $commit
done

When this reaches a cherry-picked change, it should merge it with no conflict, and later merge the CHECK-LABEL refactoring with no conflict.

When this does hit a conflict -- i.e. when an upstream change conflicts with a local change -- it shows which upstream change conflicted. This provides more context for resolving the conflict.

This script is restartable. After you've resolved a conflict and git commit'd the resolution, it's fine to re-run the script. It will skip over the already-merged upstream changes.

This approach generates many small merges, usually with small conflicts, which can be resolved incrementally and committed to a branch. That is often easier than resolving a large set of conflicts resulting from one big merge, where it's also difficult to save or reuse a work-in-progress conflict resolution.

The one-by-one merge approach has some disadvantages. If a change is committed and reverted upstream, and conflicts with your local branch, you'll be prompted to resolve the conflict twice, back and forth. If you notice the change is committed, reverted and then re-applied upstream, it can be quicker to manually git merge the re-applied version.

Going faster

The script above goes quite slowly because it merges each change in turn. An optimisation is: explicitly merge only those upstream changes that modify files that our local branch also modifies (since all other upstream changes will be trivially merged).

set -eu  # Stop on error
files="$(git diff --name-only $FROM origin/master)"
# Relatively slow
git rev-list --reverse $FROM..$TO $files > tmp_commits_to_merge
echo $TO >> tmp_commits_to_merge  # Don't miss this commit

for commit in $(cat tmp_commits_to_merge); do
  echo Merging $commit
  git merge --no-edit $commit
done

Cleaning up

This approach produces a lot of Git merge commits. If you don't want to keep this history, you can squash the merges down into a single merge commit:

tree=$(git log --pretty=format:%T --no-walk HEAD)
commit=$(git commit-tree -m 'Commit message for merge...' \
    -p $(git merge-base HEAD origin/master) \
    -p $(git merge-base HEAD upstream/master) \
    $tree)
git push . $commit:my-merge

I believe the term "history-sensitive merge" comes from Darcs, a version control system which implements history-sensitive merging.

Wednesday, 7 August 2013

Handling crashes on Mac OS X: ordering of Mach exceptions versus POSIX signals

Mac OS X is a curious operating system because its kernel is derived from two kernel codebases -- the Mach kernel and a BSD kernel -- that have been glued together.

From these two ancestors, OS X inherits two different mechanisms for processes to handle hardware exceptions (a.k.a. faults):

  • POSIX signals:
    • In-process only.
    • Registered per-process.
    • The handler is always called on the thread that produced the fault.
  • Mach exceptions:
    • Allow both in-process and out-of-process handling.
    • Can be registered per-thread and per-process.
    • The handler is invoked via Mach RPC.

On OS X, it's possible for a process to have both POSIX signal handlers and Mach exception handlers registered.

It's not immediately obvious which of the two handlers will take priority and get invoked first, but experimentation shows that it's the Mach handler. If a process faults with a memory access error, the Mach exception handler for EXC_BAD_ACCESS gets invoked first, and if this handler returns KERN_FAILURE, the POSIX signal handler for SIGBUS will then be invoked.

However, that's not the full story.

OS X has a built-in Crash Reporter service which will create a crash dump if an application crashes and pop up a dialog box. Crash Reporter works via Mach exception handling and runs out-of-process. An application will normally have Crash Reporter's crash handler registered as one of its default Mach exception handlers. But what stops this handler from interfering with normal use of POSIX signals, if Mach exception handlers take priority over POSIX signals?

The answer is that OS X's kernel has three steps for handling a hardware exception:

  1. First-chance Mach exception handler: The kernel tries to invoke the Mach exception handler registered for the type of fault that occurred, e.g. EXC_BAD_ACCESS. If there's no handler registered, it skips this step. If the handler returns KERN_SUCCESS, the kernel resumes the thread that faulted.

  2. POSIX signal handler: If the Mach exception handler was absent or returned KERN_FAILURE, the kernel tries to invoke the POSIX signal handler for the type of fault that occurred, e.g. SIGBUS.

  3. Second-chance Mach exception handler: If invoking the SIGBUS handler fails (either because no SIGBUS handler is registered, or because writing the signal frame failed, or because the signal is blocked by the thread's signal mask), then the kernel starts to terminate the process, and it tries to invoke the Mach exception handler registered for EXC_CRASH.

    Crash Reporter's handler is registered via EXC_CRASH.

    Since EXC_CRASH is generated on the code path for process termination, EXC_CRASH cannot be handled in-process. At this point, the process has been partially shut down. Its threads are not running any more, but some of its state has not been destroyed yet and so can be read by a crash handler.

    If you try to handle EXC_CRASH in-process, the process will just hang. That's presumably because the Mach RPC for invoking the EXC_CRASH handler gets sent, and the thread that would receive it has been suspended, but the RPC doesn't fail because the process's Mach ports have not been freed yet.

    abnormal_exit_notify() in osfmk/kern/exception.c is what implements sending EXC_CRASH.

    Core dumps are handled by the same code path for abnormal process termination.

EXC_CRASH seems to have been added in OS X 10.5 ("Leopard"):

"On Tiger and earlier, the CrashReporter would generate a crash report for a Mach exception even if it was later delivered as a signal and handled by the process. Since Leopard, though, a crash report is only generated if the signal is not handled. So, handling signals is sufficient to suppress crash reports these days." (Ken Thomases, September 2009)

Chromium has some code to do just that. Its DisableOSCrashDumps() function will register a signal handler in order to suppress Crash Reporter's crash reports. Its signal handler calls _exit() to prevent Crash Reporter's EXC_CRASH handler from kicking in. Another way to do this would have been to set the process's exception port for EXC_CRASH to MACH_PORT_NULL.

Saturday, 29 June 2013

Simplifying LLVM IR for PNaCl

Lately I've been working on Portable Native Client ("PNaCl" for short).

Native Client (NaCl) is a sandboxing system that allows safe execution of native code in a web browser -- typically C/C++ code compiled to x86-32, x86-64 or ARM native code (with some support for MIPS too). The problem with NaCl has always been that you have to compile your program multiple times, once for each target architecture. PNaCl aims to solve that, so that your program can be compiled once to an architecture-neutral executable which is then translated to x86 or ARM inside the web browser.

PNaCl is based on LLVM, and PNaCl's executable format is based on LLVM's IR ("Intermediate Representation") language and its binary encoding, LLVM bitcode. LLVM's IR language is quite complex, so we're pruning out and expanding out a lot of features in order to ensure that PNaCl will be maintainable in the long term and to reduce executable sizes.

As an example, consider this fragment of C code:

    struct foo {
      int x;
      int y;
    };

    void func(struct foo *ptr) {
      ptr->y = 123;
    }

Normally in LLVM this compiles to the following LLVM assembly code:

    %struct.foo = type { i32, i32 }

    define void @func(%struct.foo* nocapture %ptr) #0 {
    entry:
      %y = getelementptr inbounds %struct.foo* %ptr, i32 0, i32 1
      store i32 123, i32* %y, align 4, !tbaa !0
      ret void
    }

    attributes #0 = { nounwind }
    !0 = metadata !{metadata !"int", metadata !1}

With PNaCl, we strip out the types, names and metadata so that this becomes:

    define internal void @67(i32) {
      %2 = add i32 %0, 4
      %3 = inttoptr i32 %2 to i32*
      store i32 123, i32* %3, align 1
      ret void
    }

The definition of the struct type goes away. The "getelementptr" instruction is how LLVM handles pointer arithmetic for indexing into structs and arrays -- this gets expanded out and is replaced with pointer arithmetic on integers. The "%struct.foo*" pointer type is replaced with the i32 type. The "inttoptr" instruction remains as a vestige of LLVM's type system: every "load" or "store" instruction in a PNaCl executable is preceded by an "inttoptr" so that the program remains well-typed by LLVM's rules.

This is actually rather similar to what Emscripten and asm.js do. Like PNaCl, Emscripten is a system for running C/C++ programs in the web browser, but it compiles programs down to Javascript rather than using LLVM bitcode as an interchange format. Both PNaCl and Emscripten use Clang as a C/C++ front end, and run LLVM optimisation passes on the resulting LLVM IR code. In compiling to Javascript, Emscripten performs similar simplification steps for lowering IR, and asm.js codifies the resulting Javascript subset.

Saturday, 29 September 2012

Native Client's NTDLL patch on x86-64 Windows

Last year, I found a security hole in Native Client on 64-bit Windows that could be used to escape from the Native Client sandbox. Fortunately I found the hole before Native Client was enabled by default in Chrome. I recently wrote up the document below to explain the bug and how we fixed it.


Native Client currently relies on a small, process-local, in-memory patch to NTDLL on 64-bit versions of Windows in order to prevent a problem that would create a hole in Native Client's x86-64 sandbox.

The problem

The problem arises because Windows does not have an equivalent of Unix's sigaltstack() system call.

On Unix, a POSIX signal handler may be registered using signal() or sigaction(). When a process faults (e.g. as a result of a memory access error or an illegal instruction), the kernel passes control to the signal handler that is registered for the process. Normally, the signal handler is run on the stack of the thread that faulted. However, if an "alternate signal stack" has been registered using sigaltstack(), the signal handler will run on that stack instead.

On Windows, "vectored exception handlers" are similar to Unix signal handlers, except that a vectored exception handler always gets run on the thread's current stack. This might have been OK, except that Windows has a different division of responsibility between kernel and userland compared with Unix. Windows does more in userland, and this creates problems for NaCl.

On Unix, sigaction() is a syscall that user code calls that records a user-code signal handler function in a kernel data structure. If no signal handler is registered by user code, then when a fault occurs in a process, the kernel never passes control back to userland code in that process.

On Windows, AddVectoredExceptionHandler() is provided in userland by a DLL rather than being provided by the kernel. AddVectoredExceptionHandler() adds the handler function to a list in userland memory. When a fault occurs in the process, the Windows kernel passes control to a function called KiUserExceptionDispatcher in NTDLL, in userland. KiUserExceptionDispatcher checks whether a handler function was registered via AddVectoredExceptionHandler() (or via various other mechanisms), and if so, it calls the handler; otherwise, it terminates the process. The kernel therefore calls KiUserExceptionDispatcher regardless of whether any vectored exception handlers were registered.

This is a problem for NaCl because when we are running sandboxed code, %rsp points into untrusted address space. If sandboxed code faults on a bad memory access or a HLT instruction (which it is allowed to do), Windows calls KiUserExceptionDispatcher, which then runs on the untrusted stack. Since NaCl allows multi-threading, other sandboxed threads can modify the untrusted stack that KiUserExceptionDispatcher is running on.

Furthermore, KiUserExceptionDispatcher is a complex function which calls other NTDLL functions internally. These functions return by using the RET instruction to pop a return address off the stack, which means that control flow is determined by values stored on the untrusted stack. An attacker running as sandboxed code in another thread can modify the memory location containing a return address and so cause the NTDLL routine to jump to an address of the attacker's choice. This allows the attacker to cause a jump to a non-bundle-aligned address, which allows the attacker to escape from the NaCl sandbox.

The fix: patching NTDLL

Our fix is to patch the KiUserExceptionDispatcher entry point inside NTDLL so that this routine will safely terminate the process.

Our patching is process-local: it does not affect the NTDLL.DLL that other processes use. We only modify the in-memory copy of NTDLL.DLL that is loaded into the NaCl process.

Patching is necessary because the address of KiUserExceptionDispatcher is fixed in the Windows kernel at boot time. We don't have any way to tell the kernel to use a routine at a different address. The kernel randomizes the address of NTDLL.DLL at boot time, but thereafter NTDLL.DLL is loaded at the same address in all processes and the address of KiUserExceptionDispatcher is fixed.

What do we do to terminate the process safely? While we could overwrite the start of KiUserExceptionDispatcher with a HLT instruction, this instruction will simply cause the kernel to re-run KiUserExceptionDispatcher again, because the kernel does not track whether a thread is currently handling an exception. This would cause the kernel to repeatedly push a stack frame containing saved register state and re-call KiUserExceptionDispatcher until the thread runs out of stack. Although this will eventually terminate the process, it would be slower than we would like.

So instead, the original version of our patch overwrites the start of KiUserExceptionDispatcher with:

  mov $0, %esp
  hlt

We set the stack pointer to zero first so that the kernel can't write another stack frame. If a fault occurs when the kernel is unable to write a stack frame to memory, we observe that the kernel does not attempt to call KiUserExceptionDispatcher and instead it just terminates the process.

Breakpad crash reporting

We extended the patch to support crash reporting for crashes in trusted code. (In Chrome, this uses the Breakpad crash reporting system.)

The extended version of the patched KiUserExceptionDispatcher looks at a thread-local variable to check whether the fault occurred inside trusted code or sandboxed code. If trusted code faulted, we know we are safely running on a trusted stack, and we pass control back to the original unpatched version of KiUserExceptionDispatcher (which eventually causes the Breakpad crash handler to run and report the crash). Otherwise, we assume %rsp points to an unsafe untrusted stack, and we terminate the process using the same "mov $0, %esp; hlt" sequence as before.

x86-32 Windows

The problem does not arise on 32-bit versions of Windows because of the way NaCl's x86-32 sandbox uses segment registers.

On 32-bit Windows, when sandboxed code faults, the kernel notices that the %ss register contains a non-default value, and it terminates the process without attempting to write a stack frame or call KiUserExceptionDispatcher.

Limitations

Our NTDLL patch has a couple of limitations:

  • We are not using a stable interface. KiUserExceptionDispatcher is a Windows-internal interface between the Windows kernel and NTDLL, and it might change in future versions of Windows. We are relying on undocumented behaviour.
  • We don't control what data the Windows kernel writes to the untrusted stack. This data is momentarily visible to other sandboxed threads before the process is terminated, and it will be easily visible in embeddings of NaCl that support shared memory and multiple processes. We don't think the kernel writes any sensitive data in this stack frame, and seems unlikely that it would do so in the future, but it's still possible.

Possible alternatives

There are a couple of alternatives to patching NTDLL:

  • Using the Windows debugging API: It is possible to catch a fault in sandboxed code before KiUserExceptionDispatcher is run using the Windows debugging API. We use this API for implementing untrusted hardware exception handling.

    However, using this API comes at a cost: if a process has a debugger attached, Windows will temporarily suspend the whole process each time it creates or exits a thread. This would slow down thread creation and teardown.

    In practice, we wouldn't want to completely replace the NTDLL patch with use of the Windows debugging API. For defence-in-depth, we would want to use both.

  • Changing the sandbox model: We could change the sandbox model so that %rsp does not point into untrusted address space. One option is to use a different register as the untrusted stack pointer, and disallow use of CALL, PUSH and POP. Another option would be to have %rsp point to a thread-private stack (outside of the sandbox's usual address space). Both options would be quite invasive changes to the sandbox model.

Conclusion

Windows runs an internal function on the userland stack whenever a hardware exception occurs. This behaviour does not appear to be documented anywhere, which suggests that Windows treats this use of the userland stack as an implementation detail rather than part of the interface. For most applications, this use of the userland stack does not matter. Native Client is unusual in that this stack usage creates a hole in NaCl's sandbox, which we have to work around.

In addition to being undocumented, Windows' use of the userland stack can be surprising to people from a Unix background, given that Unix signal handlers behave differently. However, when one considers the complexity of Windows' exception handling interfaces -- which include Structured Exception Handlers, Vectored Exception Handlers and UnhandledExceptionFilters -- it makes more sense that this complex logic would live in userland.

When it comes to the kernel/userland boundary, what can be an implementation detail for one person can be an important interface detail for another. The lesson is that these details are often critical for the security of a sandbox.

References

The original sandbox hole: issue 1633 (filed in April 2011 and fixed in June 2011)

Breakpad crash reporting support for 64-bit Windows: issue 2095

Untrusted hardware exception handling for 64-bit Windows: issue 2536

Documentation for AddVectoredExceptionHandler() on MSDN

Saturday, 19 November 2011

Stack unwinding risks on 64-bit Windows

Recently, I've been looking at how x86-64 Windows does stack unwinding in 64-bit processes, and I've found some odd behaviour. If the stack unwinder finds a return address on the stack that does not have associated unwind info, it applies a fallback unwind rule that does not make much sense.

I've been wondering if this could make some x86-64 programs more easily exploitable if they corrupt the stack or if they use x86-64 assembly code that does not have unwind info.

In pseudocode, the current unwind logic looks something like this:

void unwind_stack(struct register_state regs) {
  while (???) {
    unwind_info = get_unwind_info(regs.rip);
    if (unwind_info) {
      if (has_exception_handler(unwind_info) {
        // Run exception handler...
      }
      regs = unwind_stack_frame(unwind_info, regs);
    }
    else {
      // Fallback case for leaf functions:
      regs.rip = *(uint64_t *) regs.rsp;
      regs.rsp += 8;
    }
  }
}

The issue is that the fallback case only makes sense for the first iteration of the loop. The fact that it is applied to later iterations is probably just sloppy programming.

The fallback case is intended to handle "leaf functions". This means functions that:

  1. do not adjust %rsp, and
  2. do not call other functions.

These two properties are related: if a function calls other functions, it must adjust %rsp first, otherwise it does not conform to the x86-64 ABI.

Since the fallback case is applied repeatedly, the unwinder will happily interpret the stack as a series of return addresses with no gaps between them:

     ...
     8 bytes   return address
     8 bytes   return address
     8 bytes   return address
     ...

However, those are not valid stack frames in the x86-64 Windows ABI. A valid stack frame for a non-leaf function Foo() (i.e. a function that calls other functions) looks like this:

  ------------ 16-byte aligned
    32 bytes   "shadow space" (scratch space for Foo())
     8 bytes   return address (points into Foo()'s caller)
     8 bytes   scratch space for Foo()
  16*n bytes   scratch space for Foo() (for some n >= 0)
    32 bytes   "shadow space" (scratch space for Foo()'s callees)
  ------------ 16-byte aligned

This layout comes from two requirements in the Windows x86-64 ABI:

  1. %rsp must be 8mod16 on entry to a function. (This means %rsp should be 16-byte aligned before a CALL instruction.) This requirement is common to Windows and Linux.
  2. The caller must also reserve 32 bytes of "shadow space" on the stack above the return address, which the callee can use as scratch space. (The x86-64 ABI on Linux does not have this requirement.)

This means that a function that does this:

  bar:
    call qux
    ret
cannot be valid, because:
  • it does not align the stack correctly on entry to qux();
  • more seriously, it does not allocate shadow space, so the return address that bar()'s RET instruction jumps to could have been overwritten by qux().

Yet the Windows unwinder treats the resulting stack frame as valid.

The upshot of the the fallback rule is that when the unwinder reaches a return address that does not have unwind info, it will scan up the stack looking for a value that does, looking at each 64-bit value in turn. The unwinder seems to lack basic sanity checking, so it does not stop even if it hits a zero value (which clearly cannot be a valid return address).

This has a tendency to mask mistakes. If you have an assembly function with incorrect or missing unwind info, the unwinder will tend to scan past uninitialised values or scratch data on the stack and recover at a higher-up stack frame.

The risky part is that the unwinder will be interpreting values as return addresses when they aren't return addresses at all. If the unwinder hits an address whose unwind info has an associated exception handler, the unwinder will jump to the handler. This is extra risky because language-level exceptions (e.g. for C++) and hardware exceptions are handled by the same mechanism on Windows, known as Structured Exception Handling (SEH). Both result in stack unwinding on x86-64. This means that null pointer dereferences trigger stack unwinding. This is unlike Linux, where hardware exceptions don't normally trigger stack unwinding.

This means that an attacker might be able to do the following:

  • find a function F containing an exception handler that does something interesting;
  • guess the address of the function F (more specifically, the address inside F that's inside F's __try block);
  • find another function G with incorrect or missing unwind info;
  • arrange for the address of F to be in G's stack frame;
  • cause a hardware exception (e.g. a null pointer dereference) to occur while G is on the stack;
  • and therefore cause the exception handler in F to get run.

The x86-64 version of MinGW (a GNU toolchain for Windows) is susceptible to this, because its version of GCC doesn't generate unwind info. Here's an example of how a null pointer dereference can cause an exception handler to be run:

prog.c (compile with x86-64 MinGW's gcc):

#include <stdio.h>
#include <stdlib.h>

long long get_addr();

void exc_handler() {
  printf("in exception handler!\n");
  exit(0);
}

void my_func(long long x) {
  // This might be necessary to force x to be written to the stack:
  // printf("stack=%p\n", &x);

  // This should normally kill the process safely, but it doesn't here:
  *(int *) 0 = 0;
}

int main() {
  my_func(get_addr());
  return 0;
}

get_unwind_ret.asm (assemble with Microsoft's x86-64 assembler):

PUBLIC get_addr
EXTRN exc_handler:PROC

_TEXT SEGMENT

; Helper function for getting the address inside the function below.
get_addr PROC
        lea rax, ret_addr
        ret
get_addr ENDP

; Innocent function that uses an exception handler.
blah PROC FRAME :exc_handler
.endprolog
ret_addr LABEL PTR
        hlt
blah ENDP

_TEXT ENDS

END

Here's what I get on Cygwin:

$ ml64 /c get_unwind_ret.asm
$ x86_64-w64-mingw32-gcc get_unwind_ret.obj prog.c -o prog
$ ./prog
in exception handler!

This works because GCC generates the following code without unwind info:

$ x86_64-w64-mingw32-gcc prog.c -S -o -
...
my_func:
        pushq %rbp
        movq %rsp, %rbp
        movq %rcx, 16(%rbp)
        movl $0, %eax
        movl $0, (%rax)
        leave
        ret
...

my_func()'s argument is written to the shadow space above the return address that points to main(), but since main() doesn't have unwind info either, the spilled argument gets interpreted as a return address by the unwinder.

Some conclusions:

  • Be extra careful if you're writing x86-64 assembly code on Windows.
  • Be careful if you're using GCC to generate x86-64 code on Windows and check whether it is producing unwind info. (There is no problem on x86-32 because Windows does not use stack unwinding for exception handling on x86-32.)
  • Windows' x86-64 stack unwinder should be changed to be stricter. It should stop if it encounters a return address without unwind info.

Thursday, 17 November 2011

ARM cache flushing & doubly-mapped pages

If you're familiar with the ARM architecture you'll probably know that self-modifying code has to be careful to flush the instruction cache on ARM. (Back in the 1990s, the introduction of the StrongARM with its split instruction and data caches broke a lot of programs on RISC OS.)

On ARM Linux there's a syscall, cacheflush, for flushing the instruction cache for a range of virtual addresses.

This syscall works fine if you map some code as RWX (read+write+execute) and execute it from the same virtual address that you use to modify it. This is how JIT compilers usually work.

In the Native Client sandbox, though, for dynamic code loading support, we have code pages that are mapped twice, once as RX (read+execute) and once as RW (read+write). Naively you'd expect that after you write instructions to the RW address, you just have to call cacheflush on the RX address. However, that's not enough. cacheflush doesn't just clear the i-cache. It also flushes the data cache to ensure that writes are committed to memory so that they can be read back by the i-cache. The two parts are clear if you look at the kernel implementation of cacheflush, which does two different MCRs on the address range.

I guess the syscall interface was not designed with double-mapped pages in mind, since it doesn't allow the i-cache and d-cache to be flushed separately. For the time being, Native Client will have to call cacheflush on both the RW and RX mappings.

See NaCl issue 2443 for where this came up.