Wednesday 7 August 2013

Handling crashes on Mac OS X: ordering of Mach exceptions versus POSIX signals

Mac OS X is a curious operating system because its kernel is derived from two kernel codebases -- the Mach kernel and a BSD kernel -- that have been glued together.

From these two ancestors, OS X inherits two different mechanisms for processes to handle hardware exceptions (a.k.a. faults):

  • POSIX signals:
    • In-process only.
    • Registered per-process.
    • The handler is always called on the thread that produced the fault.
  • Mach exceptions:
    • Allow both in-process and out-of-process handling.
    • Can be registered per-thread and per-process.
    • The handler is invoked via Mach RPC.

On OS X, it's possible for a process to have both POSIX signal handlers and Mach exception handlers registered.

It's not immediately obvious which of the two handlers will take priority and get invoked first, but experimentation shows that it's the Mach handler. If a process faults with a memory access error, the Mach exception handler for EXC_BAD_ACCESS gets invoked first, and if this handler returns KERN_FAILURE, the POSIX signal handler for SIGBUS will then be invoked.

However, that's not the full story.

OS X has a built-in Crash Reporter service which will create a crash dump if an application crashes and pop up a dialog box. Crash Reporter works via Mach exception handling and runs out-of-process. An application will normally have Crash Reporter's crash handler registered as one of its default Mach exception handlers. But what stops this handler from interfering with normal use of POSIX signals, if Mach exception handlers take priority over POSIX signals?

The answer is that OS X's kernel has three steps for handling a hardware exception:

  1. First-chance Mach exception handler: The kernel tries to invoke the Mach exception handler registered for the type of fault that occurred, e.g. EXC_BAD_ACCESS. If there's no handler registered, it skips this step. If the handler returns KERN_SUCCESS, the kernel resumes the thread that faulted.

  2. POSIX signal handler: If the Mach exception handler was absent or returned KERN_FAILURE, the kernel tries to invoke the POSIX signal handler for the type of fault that occurred, e.g. SIGBUS.

  3. Second-chance Mach exception handler: If invoking the SIGBUS handler fails (either because no SIGBUS handler is registered, or because writing the signal frame failed, or because the signal is blocked by the thread's signal mask), then the kernel starts to terminate the process, and it tries to invoke the Mach exception handler registered for EXC_CRASH.

    Crash Reporter's handler is registered via EXC_CRASH.

    Since EXC_CRASH is generated on the code path for process termination, EXC_CRASH cannot be handled in-process. At this point, the process has been partially shut down. Its threads are not running any more, but some of its state has not been destroyed yet and so can be read by a crash handler.

    If you try to handle EXC_CRASH in-process, the process will just hang. That's presumably because the Mach RPC for invoking the EXC_CRASH handler gets sent, and the thread that would receive it has been suspended, but the RPC doesn't fail because the process's Mach ports have not been freed yet.

    abnormal_exit_notify() in osfmk/kern/exception.c is what implements sending EXC_CRASH.

    Core dumps are handled by the same code path for abnormal process termination.

EXC_CRASH seems to have been added in OS X 10.5 ("Leopard"):

"On Tiger and earlier, the CrashReporter would generate a crash report for a Mach exception even if it was later delivered as a signal and handled by the process. Since Leopard, though, a crash report is only generated if the signal is not handled. So, handling signals is sufficient to suppress crash reports these days." (Ken Thomases, September 2009)

Chromium has some code to do just that. Its DisableOSCrashDumps() function will register a signal handler in order to suppress Crash Reporter's crash reports. Its signal handler calls _exit() to prevent Crash Reporter's EXC_CRASH handler from kicking in. Another way to do this would have been to set the process's exception port for EXC_CRASH to MACH_PORT_NULL.