Saturday 19 November 2011

Stack unwinding risks on 64-bit Windows

Recently, I've been looking at how x86-64 Windows does stack unwinding in 64-bit processes, and I've found some odd behaviour. If the stack unwinder finds a return address on the stack that does not have associated unwind info, it applies a fallback unwind rule that does not make much sense.

I've been wondering if this could make some x86-64 programs more easily exploitable if they corrupt the stack or if they use x86-64 assembly code that does not have unwind info.

In pseudocode, the current unwind logic looks something like this:

void unwind_stack(struct register_state regs) {
  while (???) {
    unwind_info = get_unwind_info(regs.rip);
    if (unwind_info) {
      if (has_exception_handler(unwind_info) {
        // Run exception handler...
      }
      regs = unwind_stack_frame(unwind_info, regs);
    }
    else {
      // Fallback case for leaf functions:
      regs.rip = *(uint64_t *) regs.rsp;
      regs.rsp += 8;
    }
  }
}

The issue is that the fallback case only makes sense for the first iteration of the loop. The fact that it is applied to later iterations is probably just sloppy programming.

The fallback case is intended to handle "leaf functions". This means functions that:

  1. do not adjust %rsp, and
  2. do not call other functions.

These two properties are related: if a function calls other functions, it must adjust %rsp first, otherwise it does not conform to the x86-64 ABI.

Since the fallback case is applied repeatedly, the unwinder will happily interpret the stack as a series of return addresses with no gaps between them:

     ...
     8 bytes   return address
     8 bytes   return address
     8 bytes   return address
     ...

However, those are not valid stack frames in the x86-64 Windows ABI. A valid stack frame for a non-leaf function Foo() (i.e. a function that calls other functions) looks like this:

  ------------ 16-byte aligned
    32 bytes   "shadow space" (scratch space for Foo())
     8 bytes   return address (points into Foo()'s caller)
     8 bytes   scratch space for Foo()
  16*n bytes   scratch space for Foo() (for some n >= 0)
    32 bytes   "shadow space" (scratch space for Foo()'s callees)
  ------------ 16-byte aligned

This layout comes from two requirements in the Windows x86-64 ABI:

  1. %rsp must be 8mod16 on entry to a function. (This means %rsp should be 16-byte aligned before a CALL instruction.) This requirement is common to Windows and Linux.
  2. The caller must also reserve 32 bytes of "shadow space" on the stack above the return address, which the callee can use as scratch space. (The x86-64 ABI on Linux does not have this requirement.)

This means that a function that does this:

  bar:
    call qux
    ret
cannot be valid, because:
  • it does not align the stack correctly on entry to qux();
  • more seriously, it does not allocate shadow space, so the return address that bar()'s RET instruction jumps to could have been overwritten by qux().

Yet the Windows unwinder treats the resulting stack frame as valid.

The upshot of the the fallback rule is that when the unwinder reaches a return address that does not have unwind info, it will scan up the stack looking for a value that does, looking at each 64-bit value in turn. The unwinder seems to lack basic sanity checking, so it does not stop even if it hits a zero value (which clearly cannot be a valid return address).

This has a tendency to mask mistakes. If you have an assembly function with incorrect or missing unwind info, the unwinder will tend to scan past uninitialised values or scratch data on the stack and recover at a higher-up stack frame.

The risky part is that the unwinder will be interpreting values as return addresses when they aren't return addresses at all. If the unwinder hits an address whose unwind info has an associated exception handler, the unwinder will jump to the handler. This is extra risky because language-level exceptions (e.g. for C++) and hardware exceptions are handled by the same mechanism on Windows, known as Structured Exception Handling (SEH). Both result in stack unwinding on x86-64. This means that null pointer dereferences trigger stack unwinding. This is unlike Linux, where hardware exceptions don't normally trigger stack unwinding.

This means that an attacker might be able to do the following:

  • find a function F containing an exception handler that does something interesting;
  • guess the address of the function F (more specifically, the address inside F that's inside F's __try block);
  • find another function G with incorrect or missing unwind info;
  • arrange for the address of F to be in G's stack frame;
  • cause a hardware exception (e.g. a null pointer dereference) to occur while G is on the stack;
  • and therefore cause the exception handler in F to get run.

The x86-64 version of MinGW (a GNU toolchain for Windows) is susceptible to this, because its version of GCC doesn't generate unwind info. Here's an example of how a null pointer dereference can cause an exception handler to be run:

prog.c (compile with x86-64 MinGW's gcc):

#include <stdio.h>
#include <stdlib.h>

long long get_addr();

void exc_handler() {
  printf("in exception handler!\n");
  exit(0);
}

void my_func(long long x) {
  // This might be necessary to force x to be written to the stack:
  // printf("stack=%p\n", &x);

  // This should normally kill the process safely, but it doesn't here:
  *(int *) 0 = 0;
}

int main() {
  my_func(get_addr());
  return 0;
}

get_unwind_ret.asm (assemble with Microsoft's x86-64 assembler):

PUBLIC get_addr
EXTRN exc_handler:PROC

_TEXT SEGMENT

; Helper function for getting the address inside the function below.
get_addr PROC
        lea rax, ret_addr
        ret
get_addr ENDP

; Innocent function that uses an exception handler.
blah PROC FRAME :exc_handler
.endprolog
ret_addr LABEL PTR
        hlt
blah ENDP

_TEXT ENDS

END

Here's what I get on Cygwin:

$ ml64 /c get_unwind_ret.asm
$ x86_64-w64-mingw32-gcc get_unwind_ret.obj prog.c -o prog
$ ./prog
in exception handler!

This works because GCC generates the following code without unwind info:

$ x86_64-w64-mingw32-gcc prog.c -S -o -
...
my_func:
        pushq %rbp
        movq %rsp, %rbp
        movq %rcx, 16(%rbp)
        movl $0, %eax
        movl $0, (%rax)
        leave
        ret
...

my_func()'s argument is written to the shadow space above the return address that points to main(), but since main() doesn't have unwind info either, the spilled argument gets interpreted as a return address by the unwinder.

Some conclusions:

  • Be extra careful if you're writing x86-64 assembly code on Windows.
  • Be careful if you're using GCC to generate x86-64 code on Windows and check whether it is producing unwind info. (There is no problem on x86-32 because Windows does not use stack unwinding for exception handling on x86-32.)
  • Windows' x86-64 stack unwinder should be changed to be stricter. It should stop if it encounters a return address without unwind info.

Thursday 17 November 2011

ARM cache flushing & doubly-mapped pages

If you're familiar with the ARM architecture you'll probably know that self-modifying code has to be careful to flush the instruction cache on ARM. (Back in the 1990s, the introduction of the StrongARM with its split instruction and data caches broke a lot of programs on RISC OS.)

On ARM Linux there's a syscall, cacheflush, for flushing the instruction cache for a range of virtual addresses.

This syscall works fine if you map some code as RWX (read+write+execute) and execute it from the same virtual address that you use to modify it. This is how JIT compilers usually work.

In the Native Client sandbox, though, for dynamic code loading support, we have code pages that are mapped twice, once as RX (read+execute) and once as RW (read+write). Naively you'd expect that after you write instructions to the RW address, you just have to call cacheflush on the RX address. However, that's not enough. cacheflush doesn't just clear the i-cache. It also flushes the data cache to ensure that writes are committed to memory so that they can be read back by the i-cache. The two parts are clear if you look at the kernel implementation of cacheflush, which does two different MCRs on the address range.

I guess the syscall interface was not designed with double-mapped pages in mind, since it doesn't allow the i-cache and d-cache to be flushed separately. For the time being, Native Client will have to call cacheflush on both the RW and RX mappings.

See NaCl issue 2443 for where this came up.