Lacking Rhoticity: native-client

Showing posts with label native-client. Show all posts

Saturday, 29 September 2012

Native Client's NTDLL patch on x86-64 Windows

Last year, I found a security hole in Native Client on 64-bit Windows that could be used to escape from the Native Client sandbox. Fortunately I found the hole before Native Client was enabled by default in Chrome. I recently wrote up the document below to explain the bug and how we fixed it.

Native Client currently relies on a small, process-local, in-memory patch to NTDLL on 64-bit versions of Windows in order to prevent a problem that would create a hole in Native Client's x86-64 sandbox.

The problem

The problem arises because Windows does not have an equivalent of Unix's sigaltstack() system call.

On Unix, a POSIX signal handler may be registered using signal() or sigaction(). When a process faults (e.g. as a result of a memory access error or an illegal instruction), the kernel passes control to the signal handler that is registered for the process. Normally, the signal handler is run on the stack of the thread that faulted. However, if an "alternate signal stack" has been registered using sigaltstack(), the signal handler will run on that stack instead.

On Windows, "vectored exception handlers" are similar to Unix signal handlers, except that a vectored exception handler always gets run on the thread's current stack. This might have been OK, except that Windows has a different division of responsibility between kernel and userland compared with Unix. Windows does more in userland, and this creates problems for NaCl.

On Unix, sigaction() is a syscall that user code calls that records a user-code signal handler function in a kernel data structure. If no signal handler is registered by user code, then when a fault occurs in a process, the kernel never passes control back to userland code in that process.

On Windows, AddVectoredExceptionHandler() is provided in userland by a DLL rather than being provided by the kernel. AddVectoredExceptionHandler() adds the handler function to a list in userland memory. When a fault occurs in the process, the Windows kernel passes control to a function called KiUserExceptionDispatcher in NTDLL, in userland. KiUserExceptionDispatcher checks whether a handler function was registered via AddVectoredExceptionHandler() (or via various other mechanisms), and if so, it calls the handler; otherwise, it terminates the process. The kernel therefore calls KiUserExceptionDispatcher regardless of whether any vectored exception handlers were registered.

This is a problem for NaCl because when we are running sandboxed code, %rsp points into untrusted address space. If sandboxed code faults on a bad memory access or a HLT instruction (which it is allowed to do), Windows calls KiUserExceptionDispatcher, which then runs on the untrusted stack. Since NaCl allows multi-threading, other sandboxed threads can modify the untrusted stack that KiUserExceptionDispatcher is running on.

Furthermore, KiUserExceptionDispatcher is a complex function which calls other NTDLL functions internally. These functions return by using the RET instruction to pop a return address off the stack, which means that control flow is determined by values stored on the untrusted stack. An attacker running as sandboxed code in another thread can modify the memory location containing a return address and so cause the NTDLL routine to jump to an address of the attacker's choice. This allows the attacker to cause a jump to a non-bundle-aligned address, which allows the attacker to escape from the NaCl sandbox.

The fix: patching NTDLL

Our fix is to patch the KiUserExceptionDispatcher entry point inside NTDLL so that this routine will safely terminate the process.

Our patching is process-local: it does not affect the NTDLL.DLL that other processes use. We only modify the in-memory copy of NTDLL.DLL that is loaded into the NaCl process.

Patching is necessary because the address of KiUserExceptionDispatcher is fixed in the Windows kernel at boot time. We don't have any way to tell the kernel to use a routine at a different address. The kernel randomizes the address of NTDLL.DLL at boot time, but thereafter NTDLL.DLL is loaded at the same address in all processes and the address of KiUserExceptionDispatcher is fixed.

What do we do to terminate the process safely? While we could overwrite the start of KiUserExceptionDispatcher with a HLT instruction, this instruction will simply cause the kernel to re-run KiUserExceptionDispatcher again, because the kernel does not track whether a thread is currently handling an exception. This would cause the kernel to repeatedly push a stack frame containing saved register state and re-call KiUserExceptionDispatcher until the thread runs out of stack. Although this will eventually terminate the process, it would be slower than we would like.

So instead, the original version of our patch overwrites the start of KiUserExceptionDispatcher with:

  mov $0, %esp
  hlt

We set the stack pointer to zero first so that the kernel can't write another stack frame. If a fault occurs when the kernel is unable to write a stack frame to memory, we observe that the kernel does not attempt to call KiUserExceptionDispatcher and instead it just terminates the process.

Breakpad crash reporting

We extended the patch to support crash reporting for crashes in trusted code. (In Chrome, this uses the Breakpad crash reporting system.)

The extended version of the patched KiUserExceptionDispatcher looks at a thread-local variable to check whether the fault occurred inside trusted code or sandboxed code. If trusted code faulted, we know we are safely running on a trusted stack, and we pass control back to the original unpatched version of KiUserExceptionDispatcher (which eventually causes the Breakpad crash handler to run and report the crash). Otherwise, we assume %rsp points to an unsafe untrusted stack, and we terminate the process using the same "mov $0, %esp; hlt" sequence as before.

x86-32 Windows

The problem does not arise on 32-bit versions of Windows because of the way NaCl's x86-32 sandbox uses segment registers.

On 32-bit Windows, when sandboxed code faults, the kernel notices that the %ss register contains a non-default value, and it terminates the process without attempting to write a stack frame or call KiUserExceptionDispatcher.

Limitations

Our NTDLL patch has a couple of limitations:

We are not using a stable interface. KiUserExceptionDispatcher is a Windows-internal interface between the Windows kernel and NTDLL, and it might change in future versions of Windows. We are relying on undocumented behaviour.
We don't control what data the Windows kernel writes to the untrusted stack. This data is momentarily visible to other sandboxed threads before the process is terminated, and it will be easily visible in embeddings of NaCl that support shared memory and multiple processes. We don't think the kernel writes any sensitive data in this stack frame, and seems unlikely that it would do so in the future, but it's still possible.

Possible alternatives

There are a couple of alternatives to patching NTDLL:

Using the Windows debugging API: It is possible to catch a fault in sandboxed code before KiUserExceptionDispatcher is run using the Windows debugging API. We use this API for implementing untrusted hardware exception handling.
However, using this API comes at a cost: if a process has a debugger attached, Windows will temporarily suspend the whole process each time it creates or exits a thread. This would slow down thread creation and teardown.
In practice, we wouldn't want to completely replace the NTDLL patch with use of the Windows debugging API. For defence-in-depth, we would want to use both.
Changing the sandbox model: We could change the sandbox model so that %rsp does not point into untrusted address space. One option is to use a different register as the untrusted stack pointer, and disallow use of CALL, PUSH and POP. Another option would be to have %rsp point to a thread-private stack (outside of the sandbox's usual address space). Both options would be quite invasive changes to the sandbox model.

Conclusion

Windows runs an internal function on the userland stack whenever a hardware exception occurs. This behaviour does not appear to be documented anywhere, which suggests that Windows treats this use of the userland stack as an implementation detail rather than part of the interface. For most applications, this use of the userland stack does not matter. Native Client is unusual in that this stack usage creates a hole in NaCl's sandbox, which we have to work around.

In addition to being undocumented, Windows' use of the userland stack can be surprising to people from a Unix background, given that Unix signal handlers behave differently. However, when one considers the complexity of Windows' exception handling interfaces -- which include Structured Exception Handlers, Vectored Exception Handlers and UnhandledExceptionFilters -- it makes more sense that this complex logic would live in userland.

When it comes to the kernel/userland boundary, what can be an implementation detail for one person can be an important interface detail for another. The lesson is that these details are often critical for the security of a sandbox.

References

The original sandbox hole: issue 1633 (filed in April 2011 and fixed in June 2011)

Breakpad crash reporting support for 64-bit Windows: issue 2095

Untrusted hardware exception handling for 64-bit Windows: issue 2536

Documentation for AddVectoredExceptionHandler() on MSDN

Sunday, 14 June 2009

Python standard library in Native Client

The Python standard library now works under Native Client in the web browser, including the Sqlite extension module.

By that I mean that it is possible to import modules from the standard library, but a lot of system calls won't be available. Sqlite works for in-memory use.

Here's a screenshot of the browser-based Python REPL:

The changes needed to make this work were quite minor:

Implementing stat() in glibc so that it does an RPC to the Javascript running on the web page. Python uses stat() when searching for modules in sys.path.
Changing fstat() in NaCl so that it doesn't return a fixed inode number. Python uses st_ino to determine whether an extension module is the same as a previously-loaded module, which doesn't work if st_ino is the same for different files.
Some tweaks to nacl-glibc-gcc, a wrapper script for nacl-gcc.
Ensuring that NaCl-glibc's libpthread.so works.

I didn't have to modify Python or Sqlite at all. I didn't even have to work out what arguments to pass to Sqlite's configure script - the Debian package for Sqlite builds without modification. It's just a matter of setting PATH to override gcc to run nacl-glibc-gcc.

Monday, 4 May 2009

Progress on Native Client

Back in January I wrote that I was porting glibc to Native Client. I have made some good progress since then.

The port is at the stage where it can run simple statically-linked and dynamically-linked executables, from the command line and from the web browser.

In particular, Python works. I have put together a simple interactive top-level that demonstrates running Python from the web browser:

Upstream NaCl doesn't support any filename-based calls such as open(), but we do. In this setup, open() does not of course access the real filesystem on the host system. open() in NaCl-glibc sends a message to the Javascript code on the web page. The Javascript code can fetch the requested file, create a file descriptor for the file using the plugin's API, and send the file descriptor to the NaCl subprocess in reply.

This has not involved modifying Python at all, although I have added an extension module to wrap a couple of NaCl-specific system calls (imc_sendmsg() and imc_recvmsg()). The Python build runs to completion, including the parts where it runs the executable it builds. A simple instruction-rewriting trick means that dynamically-linked NaCl executables can run outside of NaCl, directly under Linux, linked against the regular Linux glibc. This means we can avoid some of the problems associated with cross-compilation when building for NaCl.

This work has involved extending Native Client:

Adding support for dynamic loading of code. Initially I have focused on just making it work. Now I'll focus on ensuring this is secure.
Writing a custom browser plugin for NaCl which allows Javascript code to handle asynchronous callbacks from NaCl subprocesses.
Making various changes to the NaCl versions of gcc and binutils to support dynamically linked code.

Hopefully these changes can get merged upstream eventually. Some of the toolchain changes have already gone in.

See the web page for instructions on how to get the code and try it out.

Sunday, 11 January 2009

On ABI and API compatibility

If you are going to create a new execution environment, such as a new OS, it can make things simpler if it presents the same interfaces as an existing system. It makes porting easier. So,

If you can, keep the ABI the same.
If you can't keep the ABI the same, at least keep the API the same.

Don't be tempted to say "we're changing X; we may as well take this opportunity to change Y, which has always bugged me". Only change things if there is a good reason.

For an example, let's look at the case of GNU Hurd.

In principle, the Hurd's glibc could present the same ABI as Linux's glibc (they share the same codebase, after all), but partly because of a different in threading libraries, they were made incompatible. Unifying the ABIs was planned, but it appears that 10 years later it has not happened (Hurd has a libc0.3 package instead of libc6).
Using the same ABI would have meant that the same executables would work on Linux and the Hurd. Debian would not have needed to rebuild all its packages for a separate "hurd-i386" architecture. It would have saved a lot of effort.
I suspect that if glibc were ported to the Hurd today, it would not be hard to make the ABIs the same. The threading code has changed a lot in the intervening time. I think it is cleaner now.
The Hurd's glibc also changed the API: they decided not to define PATH_MAX. The idea was that if there was a program that used fixed-length buffers for storing filenames, you'd be forced to fix it. Well, that wasn't a good idea. It just created unnecessary work. Hurd developers and users had enough on their plates without having to fix unrelated implementation quality issues in programs they wanted to use.

Similarly, it would help if glibc for NaCl (Native Client) could have the same ABI as glibc for Linux. Programs have to be recompiled for NaCl with nacl-gcc to meet the requirements of NaCl's validator, but could the resulting code still run directly on Linux, linked with the regular i386 Linux glibc and other libraries? The problem here is that code compiled for NaCl will expect all code addresses to be 32-byte-aligned. If the NaCl code is given a function address or a return address which is not 32-byte-aligned, things will go badly wrong when it tries to jump to it. Some background: In normal x86 code, to return from a function you write this:

ret

This pops an address off the stack and jumps to it. In NaCl, this becomes:

        popl %ecx
        and $0xffffffe0, %ecx
        jmp *%ecx

This pops an address off the stack, rounds it down to the nearest 32 byte boundary and jumps to it. If the calling function's call instruction was not placed at the end of a 32 byte block (which NaCl's assembler will arrange), the return address will not be aligned and this code will jump to the wrong location.

However, there is a way around this. We can get the NaCl assembler and linker to keep a list of all the places where a forcible alignment instruction (the and $0xffffffe0, %ecx above) was inserted, and put this list into the executable or library in a special section or segment. Then when we want to run the executable or library directly on Linux, we can rewrite all these locations so that the sequence above becomes

        popl %ecx
        nop
        nop
        jmp *%ecx

or maybe even just

        ret
        nop
        nop
        nop
        nop
        nop

We can reuse the relocations mechanism to store these rewrites. The crafty old linker already does something similar for thread-local variable accesses. When it knows that a thread-local variable is being accessed from the library where it is defined, it can rewrite the general-purpose-but-slow instruction sequence for TLS variable access into a faster instruction sequence. The general purpose instruction sequence even contains nops to allow for rewriting to the slightly-longer fast sequence.

This arrangement for running NaCl-compiled code could significantly simplify the process of building and testing code when porting it to NaCl. It can help us avoid the difficulties associated with cross-compiling.

Sunday, 4 January 2009

What does NaCl mean for Plash?

Google's Native Client (NaCl), announced last month, is an ingenious hack to get around the problem that existing OSes don't provide adequate security mechanisms for sandboxing native code.

You can look at NaCl as an interesting combination of OS-based and language-based security mechanisms:

NaCl uses a code verifier to prevent use of unsafe instructions such as those that perform system calls. This is not a million miles away from programming language subsets like Cajita and Joe-E, except that it operates at the level of x86 instructions rather than source code.
Since x86 instructions are variable-length and unaligned, NaCl has to stop you from jumping into an unsafe instruction hidden in the middle of a safe instruction. It does that by requiring that all indirect jumps are jumps to the start of 32-byte-aligned blocks; instructions are not allowed to straddle these blocks.
It uses the x86 architecture's little-used segmentation feature to limit memory accesses to a range of address space. So the processor is doing bounds checking for free.
Actually, segmentation has been used before - in L4 and EROS's "small spaces" facility for switching between processes with small address spaces without flushing the TLB. NaCl gets the same benefit: switching between trusted and untrusted code should be fast; faster than trapping system calls with ptrace(), for example.

Plash is also a hack to get sandboxing, specifically on Linux, but it has some limitations:

it doesn't block network access;
it doesn't limit CPU and memory resource usage, so sandboxed programs can still cause denial of service;
it requires a custom glibc, which can be a pain to build;
it changes the API/ABI that sandboxed programs see in some small but significant ways:
- some syscalls are effectively disabled; programs must go through libc for these calls, which stops statically linked programs from working;
- /proc/self doesn't work, and Plash's architecture makes it hard to emulate /proc.

Interface changes mean some programs require patching, e.g. to not depend on /proc. If there were more people behind Plash, these interface changes wouldn't be a big problem. These problems can be addressed, with work. But Plash hasn't really caught on, so the manpower isn't there.

NaCl also breaks the ABI - it breaks it totally. Code must be recompiled. However, NaCl provides bigger benefits in return. It allows programs to be deployed in new contexts: on Windows; in a web browser. It is more secure than Plash, because it can block network access and limit the amount of memory a process can allocate. Also, because NaCl mediates access more completely, it would be easier to emulate interfaces like /proc.

NaCl isn't only useful as a browser plugin. We could use it as a general purpose OS security mechanism. We could have GNU/Linux programs running on Windows (without the Linux bit).

Currently NaCl does not support all the features you'd need in a modern OS. In particular, dynamic linking. NaCl doesn't yet support loading code beyond an initial statically linked ELF executable. But we can add this. I am making a start at porting glibc, along with its dynamic linker. After all, I have ported glibc once before!

Lacking Rhoticity