Lacking Rhoticity

Observing interrupts from userland on x86

2018-01-08T20:39:00.000+00:00

In 2016, I noticed a quirk of the x86 architecture that leads to an interesting side channel. On x86, it is possible for a userland process to detect when it has been interrupted by an interrupt handler, without resorting to timing. This is because the usual mechanism for handling interrupts (without using virtualisation) doesn't always preserve all userland registers across an interrupt handler.

If a process sets a segment selector register such as %fs or %gs to 1, the register will get set to 0 when the process gets interrupted by an interrupt. Specifically, the x86 IRET instruction will reset the register to 0 when returning to userland -- this is the instruction that kernels use for returning from an interrupt handler.

I have not seen this quirk explicitly documented anywhere, so I thought it was worthwhile documenting it via this blog post.

The following C program demonstrates the effect:

#include <stdint.h>
#include <stdio.h>

void set_gs(uint16_t value) {
  __asm__ volatile("mov %0, %%gs" : : "r"(value));
}

uint16_t get_gs() {
  uint16_t value;
  __asm__ volatile("mov %%gs, %0" : "=r"(value));
  return value;
}

int main() {
  uint16_t orig_gs = get_gs();
  set_gs(1);
  unsigned int count = 0;
  /* Loop until %gs gets reset by an interrupt handler. */
  while (get_gs() == 1)
    ++count;
  /* Restore register so as not to break TLS on x86-32 Linux.  This is
     not necessary on x86-64 Linux, which uses %fs for TLS. */
  set_gs(orig_gs);
  printf("%%gs was reset after %u iterations\n", count);
  return 0;
}

This works on x86-32 or x86-64. I tested it on Linux. It will print a non-deterministic number of iterations. For example:

%gs was reset after 1807364 iterations

Why this happens

x86 segment registers are a bit weird, because each one has two parts:
- A program-visible 16-bit "segment selector" value, which can be read and written by the MOV instruction.
- A hidden part. When you write to a segment register using the MOV instruction, the CPU also fills out the hidden part. The hidden part includes a field called DPL (Descriptor Privilege Level).
The specification for the IRET instruction contains a rule which says that:
- If we are switching to a less-privileged privilege level, such as from the kernel (ring 0) to userland (ring 3),
- and if a segment register's hidden DPL field contains a value specifying a more-privileged privilege level than what we're switching to,
- then the segment register should be reset to 0 (as if the segment selector value 0 was written to the register).
The segment selector values 0, 1, 2 and 3 are special -- they are "null segment selector" values.
When you write a null segment selector value to a segment register using MOV, the CPU apparently writes 0 into the segment register's hidden DPL field. (0 is the most-privileged privilege level.)

I say "apparently" because this does not appear to be explicitly specified in the Intel or AMD docs -- and since this part of the register is hidden, it is hard to check what it contains.

So, if you set %gs = 1, the program-visible part of %gs will contain the value 1 (i.e. reading %gs will return 1), but %gs's DPL field will have been set to 0. That will cause IRET to set %gs = 0 when returning from the kernel to userland.

Documentation for IRET

Intel's documentation for the IRET instruction describes this behaviour via the following piece of pseudocode:

RETURN-TO-OUTER-PRIVILEGE-LEVEL:
...
FOR each of segment register (ES, FS, GS, and DS)
  DO
    IF segment register points to data or non-conforming code segment
    and CPL > segment descriptor DPL (* Stored in hidden part of segment register *)
      THEN (* Segment register invalid *)
        SegmentSelector ← 0; (* NULL segment selector *)
    FI;
  OD;

(From the Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 2A, dated June 2013.)

After writing the explanation above, I checked the current version of Intel's documentation (dated December 2017). I found that the pseudocode has changed slightly. It now contains an explicit "SegmentSelector == NULL" check:

RETURN-TO-OUTER-PRIVILEGE-LEVEL:
...
FOR each SegReg in (ES, FS, GS, and DS)
  DO
    tempDesc ← descriptor cache for SegReg (* hidden part of segment register *)
    IF (SegmentSelector == NULL) OR (tempDesc(DPL) < CPL AND tempDesc(Type) is (data or non-conforming code)))
      THEN (* Segment register invalid *)
        SegmentSelector ← 0; (*Segment selector becomes null*)
    FI;
  OD;

(From the Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 2A, December 2017.)

The "SegmentSelector == NULL" check in the new version is ambiguous because there are multiple selector values that are considered to be null. Does this mean "SegmentSelector == 0" or "SegmentSelector in (0,1,2,3)"? If it's the latter, this rule would apply when IRET returns to kernel mode, not just to user mode.

AMD's documentation for IRET contains similar pseudocode:

IF (changing CPL)
{
   FOR (seg = ES, DS, FS, GS)
       IF ((seg.attr.dpl < CPL) && ((seg.attr.type = ’data’)
          || (seg.attr.type = ’non-conforming-code’)))
       {
           seg = NULL  // can’t use lower dpl data segment at higher cpl
       }
}

(From the AMD64 Architecture Programmer's Manual, Volume 3, Revision 3.25, December 2017.)

Note that not all versions of QEMU implement this faithfully in their emulation of x86.

Possible uses

It is possible to detect interrupts without this using method by using a high-resolution timer (such as x86's RDTSC instruction). However:

Sometimes hi-res timers are disabled. Sometimes this is out of concern about side channel attacks. In particular, it is possible to disable the RDTSC instruction.
Even if a process uses a timer to detect delays in its execution, this doesn't tell it the cause of a delay for certain. In contrast, checking %fs/%gs will indicate whether a delay was accompanied by an IRET.

So, this technique lets us classify delays.

This could be useful for microbenchmarking. If we're repeatedly running an operation and measuring how long it takes, we can throw out the runs that got interrupted, and thereby reduce the noise in our timing data -- an alternative to using statistical techniques for removing outliers. However, this technique won't catch interrupts that are handled while executing syscalls, because an IRET that returns to kernel mode won't reset the segment registers.

This could be useful if we're trying to read from a side channel by timing memory accesses. We can throw out accesses that got interrupted. Again, this is a way of reducing noise. However, if interrupts occur infrequently, it's possible that they don't cause enough noise for the extra work of using this technique to be worthwhile.

Mitigation via virtualisation

This issue does not occur if the userland process is running in a hardware-virtualised VM and the process is interrupted by an interrupt which is handled outside the VM. I tested this using KVM on Linux. In this case, the interrupt will cause a VMEXIT to occur; the hypervisor will handle the interrupt; and the hypervisor will use the VMENTER instruction (rather than IRET) to return back to the virtualised userland process.

It appears that the pair of VMEXIT/VMENTER operations will -- unlike IRET -- save and restore all of the segment register state, including the hidden state.

PassMark received offer to not release rowhammer test

2015-10-20T20:36:00.000+01:00

Here's an interesting report of skulduggery related to the rowhammer bug.

PassMark say they received an offer to not release a rowhammer test in their MemTest86 tool, in return for payment:

"We had anonymous contact offering to act as a go between between us and unnamed memory companies, with a view to paying us not release the new version of MemTest86. Who knows how serious the offer was.
Needless to say we didn't take up that option, and just released the software anyway.
But the issue is a BIG issue. The lack of publicity up to now is somewhat surprising considering the implications. Many computers are fundamentally (slightly) unreliable in a random ways. Maybe this doesn't matter for home use, but for medical devices, banking systems, flight control systems, etc.. it is a big deal."

The quoted post is from 20th February 2015 – after the rowhammer bug was publicised by the CMU paper but before we published about the exploitability of the bug.

Some background: PassMark are the maintainers of MemTest86. (MemTest86 should not be confused with MemTest86+ which is an alternative, open source fork of the same original codebase.)

Passing FDs/handles between processes on Unix and Windows -- a comparison

2015-05-24T21:57:00.000+01:00

Handles on Windows are analogous to file descriptors (FDs) on Unix, and both can be passed between processes. However, the way in which handles/FDs can be passed between processes is quite different on Unix and Windows.

In this blog post I'll explain the difference. You might find this useful if you are familiar with systems programming on either Unix or Windows but not both.

Similarities

I'll first explain what's the same on Unix and Windows. Both OSes have a distinction between FD/handle numbers and FD/handle objects.

On both Windows and Unix, each process has its own FD/handle table which maps from FD/handle numbers to FD/handle objects:

FD numbers are indexes into the FD table. On Unix, an FD number is an int.
Windows uses the HANDLE type for handle numbers. Though HANDLE is typedef'd to void *, a HANDLE is really just a 32-bit index. (Windows does not use a HANDLE as a pointer into the process's address space.)
FD objects are what FD numbers map to. User code never gets to see FD objects directly: it can only manipulate them via FD numbers. Multiple FD numbers can map to the same FD object. FD objects are (mostly) reference counted.
All of this applies to handles on Windows too.

Note: Some people use the alternative terminology that a "file description" refers to the underlying object while "file descriptor" refers to the number. I prefer to add "number" or "object" as a suffix as the way to disambiguate -- it is more explicit, and the term "file description" is not often used.

Differences

The key difference between Unix and Windows is this:

On Unix, FD objects can be sent via sockets in messages. On Windows, handle objects cannot be sent in messages; only handle numbers can.

Windows fills this gap by allowing one process to read or modify another process's handle table synchronously using the DuplicateHandle() API. Using this API involves one process dealing with another process's handle numbers.

In contrast, Unix has no equivalent to DuplicateHandle(). A Unix process's FD table is private to the process. Consequently, on Unix it is much rarer for a process to have dealings with another process's FD numbers.

On Windows, to send a handle to another process, the sender will generally call two system calls:

Firstly, the sender must call DuplicateHandle() to copy the handle to the destination process. This requires the sender to have a process handle for the destination process. DuplicateHandle() will return a handle number indexing into the destination process's handle table.
Secondly, the sender must communicate the handle number to the destination process, e.g. by sending a message containing the number via a pipe using WriteFile().

On Unix, to send a handle to another process, the sender will call just one system call, sendmsg(), for sending a message via a Unix domain socket:

The sender's call to sendmsg() specifies an FD number to send via the cmsg/SCM_RIGHTS interface. The kernel translates this to an FD object, and copies the reference to this object into the socket's in-kernel message buffer.
The receiver can call recvmsg() to receive the message. recvmsg() will remove the FD object from the socket's message buffer and add the FD object to the calling process's FD table, allocating a new FD number within that table.

Implications

Cycles and GC: On Unix, it is possible to create a reference cycle from a socket to itself, via the socket's own message buffer. This means it is possible that a socket FD object is only referenced by its own message buffer. Reference counting alone wouldn't be enough to free an unreachable socket such as that. Linux therefore has an in-kernel garbage collector (GC) that handles freeing unused socket FD objects when there are reference cycles.
In contrast, Windows doesn't need a GC for handling cycles between handles.
Process handles: Windows has a notion of process handles, which are used by the DuplicateHandle() API. Unix does not have an equivalent concept of process FDs as standard, although FreeBSD has process FDs (added as part of the Capsicum API).
Sandboxing: It is easier to handle sandboxed processes with the Unix model than with the Windows model.
On Unix, it is easy for mutually distrusting processes to exchange FDs, because the kernel handles translating FD numbers between processes.
On Windows, if two processes want to exchange handles, one will generally have a process handle for the other process, which gives the first process control over the latter process. An alternative is to use a broker process for exchanging handles. Chrome's Windows sandbox implements this for exchanging handles between sandboxed processes (see BrokerDuplicateHandle() in content/common/sandbox_win.cc).
Namespace hazard: Windows' handle-passing model creates a potential hazard: If DuplicateHandle() gives you a HANDLE that's an index into another process's handle table, you must be careful not to treat it as an index into your process's handle table. For example, don't call CloseHandle() on it, otherwise you might close an unrelated handle.
It is much rarer for this hazard to arise on Unix.
Unix emulation: The difference between the FD/handle-passing models makes it tricky to emulate Unix domain sockets on Windows.
An accurate emulation would need to implement the semantics of how an FD object can be stored in the message buffer of a socket that might be read from concurrently by multiple processes.
Cygwin implements an emulation of Unix domain sockets on Windows, but it currently does not implement FD-passing.

(Updated on 2015/07/03 [previous version]: Corrected the post to say that Cygwin does not implement FD-passing.)

Can cached memory accesses do double-sided row hammering?

2015-05-11T23:35:00.000+01:00

There are indications that it is possible to cause bit flips in memory by row hammering without using CLFLUSH, using normal cached memory accesses.

This makes me wonder: Is it possible to do double-sided row hammering using cached memory accesses, or only single-sided row hammering? The former is more likely to cause bit flips, and might be the only way to cause bit flips on some machines, such as those using a 2x refresh rate -- i.e. those configured to refresh DRAM every 32ms instead of every 64ms. (See the rowhammer blog post for more background.)

The answer appears to be "no" -- at least on my test machine.

For this machine (which has a Sandy Bridge CPU), I figured out how physical addresses map to cache sets and to banks and rows in DRAM. We can use these mappings to answer questions about what kinds of row hammering are possible using cached memory accesses.

More specifically, my question is this: For a machine with an N-way L3 cache, is it possible to pick N+1 addresses that map to the same cache set, where at least two of these addresses map to rows R-1 and R+1 in one bank (for some neighbouring row R)? If so, repeatedly accessing these addresses would cause cache misses that cause rows R-1 and R+1 to be repeatedly activated. That puts more stress on row R (the victim row) than repeatedly activating only row R-1 or row R+1.

The answer to this is "no": It's not possible to pick two such physical addresses.

Here's why:

Suppose we have two such addresses, A and B. Then:

The addresses map to the same bank, so:
(1): A[14:17] ^ A[18:21] = B[14:17] ^ B[18:21]
(using the bank/row XOR scheme I described previously)
The addresses are 2 rows apart, so:
(2): A[18:32] + 2 = B[18:32]
The addresses map to the same cache set, so:
(3): A[6:17] = B[6:17]
(also, SliceHash(A[17:32]) = SliceHash(B[17:32]), but we don't need this property)

(2) implies that A[19] = ~B[19].

(3) implies that A[14:17] = B[14:17]. Combining that with (1) gives A[18:21] = B[18:21]. That implies A[19] = B[19], which contradicts (2), so the constraints can't be satisfied.

Basically, the bank/row XOR scheme used by the memory controller's address mapping gets in the way.

It looks like an attacker who is trying to do row hammering using only cached memory accesses (e.g. from Javascript) would, on this particular machine, have to try one of two routes:

Do single-sided row hammering only.
Attempt double-sided hammering by picking two sets of 13 addresses (given a 12-way L3 cache), mapping to two different cache sets. This would likely halve the rate of memory accesses (relative to accessing one set of 13 addresses).

Some further notes

I might be getting ahead of myself by doing this analysis, because the feasibility of causing rowhammer-induced bit flips via cached accesses has not yet been demonstrated on a wide selection of machines.

Also, I am asking whether double-sided hammering is possible, not practical, given the constraint that the attacker's memory accesses must go through the cache. While my analysis shows that the risk of causing double-sided hammering is zero with this constraint, the risk might only be negligible without this constraint. Even if a program can use CLFLUSH to bypass the cache, it doesn't necessarily have a way to determine which pairs of addresses map to rows that are spaced apart by two (see the rowhammer blog post for further discussion). Such a program might only be able to get such pairs by chance, by selecting addresses randomly, which would take a long time.

Lastly, my analysis makes assumptions about a machine's DRAM and L3 cache mappings. It would need to be redone for other machines where these assumptions aren't true.

How physical addresses map to rows and banks in DRAM

2015-05-04T23:57:00.000+01:00

In my previous blog post, I discussed how Intel Sandy Bridge CPUs map physical addresses to locations in the L3 cache.

Now I'll discuss how these CPUs' memory controllers map physical addresses to locations in DRAM -- specifically, to row, bank and column numbers in DRAM modules. Let's call this the DRAM address mapping. I'll use one test machine as a case study.

Motivation: the rowhammer bug

I am interested in the DRAM address mapping because it is relevant to the "rowhammer" bug.

Rowhammer is a problem with some DRAM modules whereby certain pessimal memory access patterns can cause memory corruption. In these DRAMs, repeatedly activating a row of memory (termed "row hammering") can produce electrical disturbances that produce bit flips in vulnerable cells in adjacent rows of memory.

These repeated row activations can be caused by repeatedly accessing a pair of DRAM locations that are in different rows of the same bank of DRAM. Knowing the DRAM address mapping is useful because it tells us which pairs of addresses satisfy this "same bank, different row" (SBDR) property.

Guessing and checking an address mapping

For my case study, I have a test machine containing DRAM that is vulnerable to the rowhammer problem. Running rowhammer_test on this machine demonstrates bit flips.

I'd like to know what the DRAM address mapping is for this machine, but apparently it isn't publicly documented: This machine has a Sandy Bridge CPU, but Intel don't document the address mapping used by these CPUs' memory controllers.

rowhammer_test does not actually need to identify SBDR address pairs. rowhammer_test just repeatedly tries hammering randomly chosen address pairs. Typically 1/8 or 1/16 of these pairs will be SBDR pairs, because our machine has 8 banks per DIMM (and 16 banks in total). So, while we don't need to know the DRAM address mapping to cause bit flips on this machine, knowing it would help us be more targeted in our testing.

Though the address mapping isn't documented, I found that I can make an educated guess at what the mapping is, based on the DRAM's geometry, and then verify the guess based on the physical addresses that rowhammer_test reports. rowhammer_test can report the physical addresses where bit flips occur ("victims") and the pairs of physical addresses that produce those bit flips ("aggressors"). Since these pairs must be SBDR pairs, we can check a hypothesised address mapping against this empirical data.

Memory geometry

The first step in hypothesising an address mapping for a machine is to check how many DIMMs the machine has and how these DIMMs are organised internally.

I can query information about the DIMMs using the decode-dimms tool on Linux. (In Ubuntu, decode-dimms is in the i2c-tools package.) This tool decodes the DIMMs' SPD (Serial Presence Detect) metadata.

My test machine has 2 * 4GB SO-DIMMs, giving 8GB of memory in total.

decode-dimms reports the following information for both of the DIMMs:

Size                                            4096 MB
Banks x Rows x Columns x Bits                   8 x 15 x 10 x 64
Ranks                                           2

This means that, for each DIMM:

Each of the DIMM's banks contains 2^15 rows (32768 rows).
Each row contains 2^10 * 64 bits = 2^16 bits = 2^13 bytes = 8 kbytes.

Each DIMM has 2 ranks and 8 banks. Cross checking the capacity of the DIMM gives us the reported size, as expected:

8 kbytes per row * 32768 rows * 2 ranks * 8 banks = 4096 MB = 4 GB

The DRAM address mapping

On my test machine, it appears that the bits of physical addresses are used as follows:

Bits 0-5: These are the lower 6 bits of the byte index within a row (i.e. the 6-bit index into a 64-byte cache line).
Bit 6: This is a 1-bit channel number, which selects between the 2 DIMMs.
Bits 7-13: These are the upper 7 bits of the index within a row (i.e. the upper bits of the column number).
Bits 14-16: These are XOR'd with the bottom 3 bits of the row number to give the 3-bit bank number.
Bit 17: This is a 1-bit rank number, which selects between the 2 ranks of a DIMM (which are typically the two sides of the DIMM's circuit board).
Bits 18-32: These are the 15-bit row number.
Bits 33+: These may be set because physical memory starts at physical addresses greater than 0.

Why is the mapping like that?

This mapping fits the results from rowhammer_test (see below), but we can also explain that the address bits are mapped this way to give good performance for typical memory access patterns, such as sequential accesses and strided accesses:

Channel parallelism: Placing the channel number at bit 6 means that cache lines will alternate between the two channels (i.e. the two DIMMs), which can be accessed in parallel. This means that if we're accessing addresses sequentially, the load will be spread across the two channels.
As an aside, Ivy Bridge (the successor to Sandy Bridge) apparently makes the mapping of the channel number more complex. An Intel presentation mentions "Channel hashing" and says that this "Allows channel selection to be made based on multiple address bits. Historically, it had been "A[6]". Allows more even distribution of memory accesses across channels."
Bank thrashing: Generally, column, bank and row numbers are arranged to minimise "bank thrashing" (frequently changing a bank's currently activated row).
Some background: DRAM modules are organised into banks, which in turn are organised into rows. Each bank has a "currently activated row" whose contents are copied into a row buffer which acts as a cache that can be accessed quickly. Accessing a different row takes longer because that row must be activated first. So, the DRAM address mapping places SBDR pairs as far apart as possible in physical address space.
Row hammering is a special case of bank thrashing where two particular rows are repeatedly activated (perhaps deliberately).
Bank parallelism: Banks can be accessed in parallel (though to a lesser degree than channels), so the bank number changes before the row number as the address is increased.
XOR scheme: XORing the row number's lower bits into the bank number is a trick to avoid bank thrashing when accessing arrays with large strides. For example, in the mapping above, the XORing causes addresses X and X+256k to be placed in different banks instead of being an SBDR pair.
Bank/row XORing schemes are described in various places, such as:
- David Tawei Wang's PhD thesis, "Modern DRAM memory systems: Performance analysis and scheduling algorithms", 2005. See section 5.3.5, "Bank Address Aliasing (stride collision)". This thesis has some good background info on DRAM in general.
- The paper "Reducing DRAM Latencies with an Integrated Memory Hierarchy Design", 2001. See Figure 3.

Checking against `rowhammer_test`'s output

I ran rowhammer_test_ext (the extended version of rowhammer_test) on my test machine for 6 hours, and it found repeatable bit flips at 22 locations. (See the raw data and analysis code.)

The row hammering test generates a set of (A1, A2, V) triples, where:

V is the victim address where we see the bit flip.
A1 and A2 are the aggressor addresses that we hammer.
We sort A1 and A2 so that A1 is closer to V than A2 is. We tentatively assume that the closer address, A1, is the one that actually causes the bit flips (though this wouldn't necessarily be true if the DRAM address mapping were more complicated).

There are three properties we expect to hold for all of these results:

Row: A1 and V's row numbers should differ by 1 -- i.e. they should be in adjacent rows. (A2 can have any row number.)
This property makes it easy to work out where the bottom bits of the row number are in the physical address.
We find this property holds for all but 2 of the results. In those 2 results, the row numbers differ by 3 rather than 1.
Bank: V, A1 and A2 should have the same bank number. Indeed, we find this property holds for all 22 results. This only holds when applying the row/bank XORing scheme.
Channel: V, A1 and A2 should have the same channel number. This holds for all the results. It happens that all of our results have channel=0, because rowhammer_test only selects 4k-aligned addresses and so only tests one channel. (Maybe this could be considered a bug.)

Possible further testing

There are two further experiments we could run to check whether the DRAM address mapping evaluates the SBDR property correctly, which I haven't tried yet:

Timing tests: Accessing SBDR address pairs repeatedly should be slower than accessing non-SBDR pairs repeatedly, because the former cause row activations and the latter don't.
Exhaustive rowhammer testing: Once we've found an aggressor address, A1, that causes a repeatable bit flip, we can test this against many values of address A2. Hammering (A1, A2) can produce bit flips only if this is an SBDR pair.

Furthermore, taking out one DIMM from our test machine should remove the channel bit from the DRAM address mapping and change the aggressor/victim addresses accordingly. We could check whether this is the case.

L3 cache mapping on Sandy Bridge CPUs

2015-04-27T22:59:00.000+01:00

In 2013, some researchers reverse-engineered how Intel Sandy Bridge CPUs map physical addresses to cache sets in the L3 cache (the last-level cache). They were interested in the cache mapping because it can be used to defeat kernel ASLR. I'm interested because the cache mapping can be used to test whether cached memory accesses can do row hammering (which can cause exploitable bit flips in some DRAM devices).

The researchers published the details in the paper "Practical Timing Side Channel Attacks Against Kernel Space ASLR" (Ralf Hund, Carsten Willems and Thorsten Holz).

They only published the mapping for 4-core CPUs, but I have figured out the mapping for 2-core CPUs as well.

Some background:

On Sandy Bridge CPUs, the L3 cache is divided into slices. Physical addresses are hashed to determine which slice of the L3 cache they will be stored in.
The L3 cache is distributed and ring-based. There is one slice per core, but all the cores in a CPU can access all the cache slices via a ring bus which connects all the cores and their caches together.
When a core accesses a memory location, the location will be slightly slower to access if it maps to a different core's cache slice, because it would take one or two hops around the ring bus to access it. The protocol used on the ring bus is based on QPI (Intel's QuickPath Interconnect). (QPI is a protocol used for connecting multiple CPUs together on high-end multi-socket systems.)
Each cache slice contains 2048 cache sets.
On lower-end CPUs, cache sets are 12-way associative, so a cache slice is 1.5MB in size (2048 sets * 12 ways * 64 bytes per cache line = 1.5MB).
On higher-end CPUs, cache sets are 16-way associative, so a cache slice is 2MB in size (2048 sets * 16 ways * 64 bytes per cache line = 2MB).

Cache mapping

The researchers (Hund et al) figured out that the L3 cache uses the bits of a physical address as follows:

Bits 0-5: These give the 6-bit byte offset within a 64-byte cache line.
Bits 6-16: These give the 11-bit number of the cache set within a cache slice.
Bits 17-31: These are hashed to determine which cache slice to use (see below).
Bits 32+: Not used.

The hash function used to choose the cache slice is as follows:

On a 4-core CPU, there are 4 cache slices, so the slice number is 2-bit. The two bits of the slice number are h1 and h2, where:
- h1 is the XOR of physical address bits 18, 19, 21, 23, 25, 27, 29, 30, 31.
- h2 is the XOR of physical address bits 17, 19, 20, 21, 22, 23, 24, 26, 28, 29, 31.
On a 2-core CPU, there are 2 cache slices, so the slice number is 1-bit. The slice number is the XOR of physical address bits 17, 18, 20, 22, 24, 25, 26, 27, 28, 30.
Notice that this is equal to the XOR of h1 and h2. (Bits 19, 21, 23, 29 and 31 cancel out when XORing h1 and h2.) This is the part I figured out.

Verifying the cache mapping

I figured out the 2-core hash function by making an educated guess and verifying it.

I verified the guess by writing a C++ program which:

Picks N physical memory addresses that should map to the same cache set, according to the cache mapping that we guessed.
We use Linux's /proc/PID/pagemap interface to determine which physical addresses we have access to.
Measures the time it takes to access the N addresses. Specifically, the program accesses the first N-1 addresses and then times accessing the Nth address.

The program does that for multiple values of N.

If we guessed the cache mapping correctly, then, on a CPU with a 12-way cache, we should see the memory access time spike upwards at N=13. This is because, at N=13, the memory locations we're accessing no longer fit within the 12-way cache set, and we'll get L3 cache misses. The memory access time will go up from the L3 cache's latency to DRAM's latency.

That increase is indeed what we see:

(Note: This also assumes that the cache uses an LRU or Pseudo-LRU eviction policy, which is what Sandy Bridge uses. However, the cache eviction policy changed in Ivy Bridge, which is the successor to Sandy Bridge -- see below.)

On the other hand, if we guessed the cache mapping incorrectly, the memory access time will go upwards gradually at higher values of N. Specifically, on a 2-cache-slice CPU, if we get the address-to-slice hash function wrong, we'll see the access time reach DRAM latency at N=13*2, on average. That's because our N physical addresses will be distributed across 2 slices, so on average it will take 13*2 addresses before the 2 cache sets in the slices overflow and produce cache misses.

That's what we get if we wrongly guess that the address-to-slice hash function is h1 (which was my first guess):

Further verification

What happens if our address-to-slice hash function is mostly right, but not completely right? What happens if one of the bits we're XORing is wrong? (For example, maybe bit 17 should really be left out or bit 19 should really be XOR'd too.) Would we notice? Yes, we can test for that.

For each bit from 17 to 31, we can try inverting whether that bit is included in the set of bits that our address-to-slice hash function XORs together.

Here's the graph we get for the resulting hash functions:

Our original hash function (in bold) shows the latency going up soonest, at N=13, so it is likely to be the correct hash function.

Test machine

I ran my tests on a machine with a 2-core, 4-hyperthread Sandy Bridge CPU that reports the following in /proc/cpuinfo on Linux:

cpu family      : 6
model           : 42
model name      : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
stepping        : 7
cache size      : 3072 KB
cpu cores       : 2

Ivy Bridge

This L3 cache mapping appears to apply to Ivy Bridge CPUs too. I ran the same tests on a machine with an Ivy Bridge CPU (also 2-core, 4-hyperthread) and initially got the same shape graphs.

However, the results weren't consistently reproducible on this machine. Later runs showed higher memory access times for N<=12.

This is consistent with reports that Ivy Bridge's L3 cache uses DIP (Dynamic Insertion Policy) for its cache eviction policy, in order to protect against cache thrashing. DIP switches dynamically between an LRU policy (better for smaller working sets which fit in the cache) and a BIP policy (better for larger working sets which don't fit in the cache). For N>12, my test probably generates enough cache misses to cause the cache to switch to BIP mode. Note that this means the order in which we test values of N can affect the results we get.

This machine's /proc/cpuinfo reports:

cpu family      : 6
model           : 58
model name      : Intel(R) Core(TM) i5-3427U CPU @ 1.80GHz
stepping        : 9
cache size      : 3072 KB
cpu cores       : 2

See Henry Wong's blog post, "Intel Ivy Bridge Cache Replacement Policy", which compares Sandy Bridge and Ivy Bridge.

Also see the paper "Adaptive Insertion Policies for High Performance Caching" (Moinuddin K. Qureshi, Aamer Jaleel, Yale N. Patt, Simon C. Steely Jr., Joel Emer), which describes DIP (Dynamic Insertion Policy).

Thanks

Thanks to Yossef Oren for pointing me to the paper by Hund et al, which is referenced by the paper he coauthored, "The Spy in the Sandbox -- Practical Cache Attacks in Javascript" (Yossef Oren, Vasileios P. Kemerlis, Simha Sethumadhavan, Angelos D. Keromytis).

The DRAM rowhammer bug is exploitable

2015-03-10T21:53:00.000+00:00

I've been researching the DRAM rowhammer issue and its security implications for a while. We've finally published our findings on the Project Zero blog: Exploiting the DRAM rowhammer bug to gain kernel privileges.

Conditionalising C/C++ code: "#ifdef FOO" vs. "#if FOO"

2015-01-21T20:14:00.000+00:00

Is it better to use #ifdef PLATFORM or #if PLATFORM when writing code that needs to be conditionalised according to OS, CPU architecture, etc.?

Chromium's codebase uses the former. For example, it uses #ifdef OS_WIN or #if defined(OS_WIN), where OS_WIN is #defined on Windows (by build/build_config.h) and undefined elsewhere.

In contrast, NaCl's codebase uses the latter. It uses #if NACL_WINDOWS or if (NACL_WINDOWS), where NACL_WINDOWS is always #defined: as 1 on Windows and 0 elsewhere.

This latter approach has the benefit of catching some mistakes:

If you mistype the macro name in #if NACL_WINDOWS, you can get a compiler warning.
However, this only works if you enable GCC/Clang's -Wundef warning. Microsoft's compiler (MSVC) has a similar warning, /wd4668, but it's effectively unusable because it produces warnings about system header files.
You can sometimes write if (PLATFORM) instead of #if PLATFORM. The if() version has the advantage that the code in the if() block will be compile-tested on all platforms, not just those where PLATFORM == 1. This can help catch mistakes earlier.
This doesn't work if the code block uses functions that are only defined on PLATFORM, though.

Implementing fork() on the Mill CPU

2014-07-23T04:12:00.000+01:00

The Mill is a new CPU architecture that claims to provide high performance but at a much better performance-per-watt than conventional CPUs that use out-of-order execution.

The Mill achieves this by making various architectural simplifications. One of those is to remove the TLB from the fast path of memory accesses. Rather than having the TLB between the CPU core and the cache, the Mill's TLB is between the cache and main memory.

OS models

This means the Mill is best suited for running Single Address Space operating systems (SASOS). The intent is that different processes will live at different addresses within a shared 64-bit address space. A process runs with permissions to access restricted ranges of this address space. Switching between processes is therefore just a matter of switching those permissions, which is fast on the Mill. It doesn't involve an expensive flush of the TLB (as on a conventional OS). It doesn't involve flushing the Mill's virtual-address-tagged (VIVT) cache.

This runs into a problem if we want to run Unix programs that use fork(), though. Use of fork() assumes that multiple processes will want to use the same virtual addresses.

The Mill developers have said they have a scheme for handling fork(), but they haven't said what it is, so I'm going to speculate. :-)

Context switching

If you create forked processes that share a range of virtual address space, a Mill OS can just flush the TLB and cache for those ranges whenever it needs to context switch between the two processes.

Basically, a Mill OS can act like a Separate Address Space OS when handling forked processes.

Switching costs

How expensive that would be depends on how the TLB and cache work.

In conventional CPUs, the TLB is per-core. The Mill's TLB might be shared between cores, though, if the Mill has higher-level caches that are both virtually-tagged and shared between cores.
If that's the case, forked processes wouldn't be able to run concurrently on multiple cores.
Flushing the TLB and/or cache when context switching between forked processes would be slow.
However, the OS would only have to flush the address ranges that have diverged between the forked processes -- that is, those that have been written to (as copy-on-write) or mmap()'d.

This would be fine for most uses of fork() on Unix, where execve() is called shortly after fork(). Forked processes usually exist only briefly, so there's little need to context-switch between them.

Zygotes

Forked processes are sometimes used as an optimisation, though. Android and Linux Chrome fork many child processes from a parent "zygote" process. This is a way to speed up process creation and reduce memory usage by sharing pages. This technique would become a pessimisation on the Mill. Android and Chrome would have to find a different way to do this.

Mill portals

Implementing fork() on the Mill doesn't seem very difficult, then.

However, there is one Mill feature that fork() might interfere with.

Mill provides "portals", a mechanism for one protection domain (a process in the single 64-bit address space) to call into another. A Mill OS would have to ensure that a forked process can't be invoked directly via a portal if that forked process has been context-switched out. The portal would have to be redirected to a routine that waits for the forked process to be context-switched back in before invoking it. Alternatively, a Mill OS could just disallow invoking forked process via portals entirely.

This doesn't seem like a big problem. Portals are a Mill-specific feature for optimising inter-process calls, whereas fork() is a "legacy" Unix feature. We want fork() for running existing Unix programs, which won't need to define Mill portals.

There might be an issue if a portal invocation needs to return to a forked process. For example, Unix syscalls might be implemented as portal invocations. Maybe these invocations would need to go via a proxy that knows how to context-switch the calling process back in.

How to do history-sensitive merges in Git

2014-03-26T17:17:00.000+00:00

Merging in Git is usually not history-sensitive. By this I mean: if you're merging branches A and B together, Git looks at the content at the tips of branches A and B, and the content of the common ancestor commit(s) of A and B, but it doesn't look at any commits inbetween. Git just does a 3-way merge. This can make merging more painful than it needs to be.

History-sensitive merging is a merge algorithm that does look at the commit history. It can automatically resolve conflicts in cases where non-history-sensitive merging won't. It can also present conflicts more readably: It can reduce the size of a conflict and show which upstream change conflicted with your local change.

This is a short "how to" for doing a history-sensitive merge in Git.

My case study is: Merging LLVM 3.4 into PNaCl's branch of LLVM, which I did recently.

The problem

PNaCl has a branch of LLVM with various patches applied. (As an aside, we should really upstream those patches, to reduce the difficulty of merging from upstream.) Before the merge, this branch was based on LLVM 3.3 (commit $FROM). We wanted to upgrade it to being based on LLVM 3.4 (commit $TO).

Simply doing

git merge $TO

produced about 160 conflicts, as counted by "git grep '<<<<<<<' | wc -l".

A large number of those conflicts occurred because pnacl-llvm contained various changes cherry-picked from upstream LLVM. Often those changes touched code that was later refactored upstream. That will produce a conflict if we do a 3-way merge.

For example, pnacl-llvm cherry-picked various changes that add test files with "CHECK:" lines. Upstream later refactored these to "CHECK-LABEL:". If we do "git merge $TO", Git will see that the two branches added a file with the same name but different contents. Bam: that's a conflict. Git's 3-way merge doesn't look at the history, so it doesn't notice that both branches contain identical commits that add the file with a "CHECK:" line, and just one of the two branches later modifies that file.

A solution

We can do better just by telling Git to merge each upstream LLVM change one by one:

set -eu  # Stop on error
commits="$(git rev-list --reverse $FROM..$TO)"
for commit in $commits; do
  echo Merging $commit
  git merge --no-edit $commit
done

When this reaches a cherry-picked change, it should merge it with no conflict, and later merge the CHECK-LABEL refactoring with no conflict.

When this does hit a conflict -- i.e. when an upstream change conflicts with a local change -- it shows which upstream change conflicted. This provides more context for resolving the conflict.

This script is restartable. After you've resolved a conflict and git commit'd the resolution, it's fine to re-run the script. It will skip over the already-merged upstream changes.

This approach generates many small merges, usually with small conflicts, which can be resolved incrementally and committed to a branch. That is often easier than resolving a large set of conflicts resulting from one big merge, where it's also difficult to save or reuse a work-in-progress conflict resolution.

The one-by-one merge approach has some disadvantages. If a change is committed and reverted upstream, and conflicts with your local branch, you'll be prompted to resolve the conflict twice, back and forth. If you notice the change is committed, reverted and then re-applied upstream, it can be quicker to manually git merge the re-applied version.

Going faster

The script above goes quite slowly because it merges each change in turn. An optimisation is: explicitly merge only those upstream changes that modify files that our local branch also modifies (since all other upstream changes will be trivially merged).

set -eu  # Stop on error
files="$(git diff --name-only $FROM origin/master)"
# Relatively slow
git rev-list --reverse $FROM..$TO $files > tmp_commits_to_merge
echo $TO >> tmp_commits_to_merge  # Don't miss this commit

for commit in $(cat tmp_commits_to_merge); do
  echo Merging $commit
  git merge --no-edit $commit
done

Cleaning up

This approach produces a lot of Git merge commits. If you don't want to keep this history, you can squash the merges down into a single merge commit:

tree=$(git log --pretty=format:%T --no-walk HEAD)
commit=$(git commit-tree -m 'Commit message for merge...' \
    -p $(git merge-base HEAD origin/master) \
    -p $(git merge-base HEAD upstream/master) \
    $tree)
git push . $commit:my-merge

I believe the term "history-sensitive merge" comes from Darcs, a version control system which implements history-sensitive merging.

Handling crashes on Mac OS X: ordering of Mach exceptions versus POSIX signals

2013-08-07T21:25:00.000+01:00

Mac OS X is a curious operating system because its kernel is derived from two kernel codebases -- the Mach kernel and a BSD kernel -- that have been glued together.

From these two ancestors, OS X inherits two different mechanisms for processes to handle hardware exceptions (a.k.a. faults):

POSIX signals:
- In-process only.
- Registered per-process.
- The handler is always called on the thread that produced the fault.
Mach exceptions:
- Allow both in-process and out-of-process handling.
- Can be registered per-thread and per-process.
- The handler is invoked via Mach RPC.

On OS X, it's possible for a process to have both POSIX signal handlers and Mach exception handlers registered.

It's not immediately obvious which of the two handlers will take priority and get invoked first, but experimentation shows that it's the Mach handler. If a process faults with a memory access error, the Mach exception handler for EXC_BAD_ACCESS gets invoked first, and if this handler returns KERN_FAILURE, the POSIX signal handler for SIGBUS will then be invoked.

However, that's not the full story.

OS X has a built-in Crash Reporter service which will create a crash dump if an application crashes and pop up a dialog box. Crash Reporter works via Mach exception handling and runs out-of-process. An application will normally have Crash Reporter's crash handler registered as one of its default Mach exception handlers. But what stops this handler from interfering with normal use of POSIX signals, if Mach exception handlers take priority over POSIX signals?

The answer is that OS X's kernel has three steps for handling a hardware exception:

First-chance Mach exception handler: The kernel tries to invoke the Mach exception handler registered for the type of fault that occurred, e.g. EXC_BAD_ACCESS. If there's no handler registered, it skips this step. If the handler returns KERN_SUCCESS, the kernel resumes the thread that faulted.
POSIX signal handler: If the Mach exception handler was absent or returned KERN_FAILURE, the kernel tries to invoke the POSIX signal handler for the type of fault that occurred, e.g. SIGBUS.
Second-chance Mach exception handler: If invoking the SIGBUS handler fails (either because no SIGBUS handler is registered, or because writing the signal frame failed, or because the signal is blocked by the thread's signal mask), then the kernel starts to terminate the process, and it tries to invoke the Mach exception handler registered for EXC_CRASH.
Crash Reporter's handler is registered via EXC_CRASH.
Since EXC_CRASH is generated on the code path for process termination, EXC_CRASH cannot be handled in-process. At this point, the process has been partially shut down. Its threads are not running any more, but some of its state has not been destroyed yet and so can be read by a crash handler.
If you try to handle EXC_CRASH in-process, the process will just hang. That's presumably because the Mach RPC for invoking the EXC_CRASH handler gets sent, and the thread that would receive it has been suspended, but the RPC doesn't fail because the process's Mach ports have not been freed yet.
abnormal_exit_notify() in osfmk/kern/exception.c is what implements sending EXC_CRASH.
Core dumps are handled by the same code path for abnormal process termination.

EXC_CRASH seems to have been added in OS X 10.5 ("Leopard"):

"On Tiger and earlier, the CrashReporter would generate a crash report for a Mach exception even if it was later delivered as a signal and handled by the process. Since Leopard, though, a crash report is only generated if the signal is not handled. So, handling signals is sufficient to suppress crash reports these days." (Ken Thomases, September 2009)

Chromium has some code to do just that. Its DisableOSCrashDumps() function will register a signal handler in order to suppress Crash Reporter's crash reports. Its signal handler calls _exit() to prevent Crash Reporter's EXC_CRASH handler from kicking in. Another way to do this would have been to set the process's exception port for EXC_CRASH to MACH_PORT_NULL.

Simplifying LLVM IR for PNaCl

2013-06-29T20:08:00.000+01:00

Lately I've been working on Portable Native Client ("PNaCl" for short).

Native Client (NaCl) is a sandboxing system that allows safe execution of native code in a web browser -- typically C/C++ code compiled to x86-32, x86-64 or ARM native code (with some support for MIPS too). The problem with NaCl has always been that you have to compile your program multiple times, once for each target architecture. PNaCl aims to solve that, so that your program can be compiled once to an architecture-neutral executable which is then translated to x86 or ARM inside the web browser.

PNaCl is based on LLVM, and PNaCl's executable format is based on LLVM's IR ("Intermediate Representation") language and its binary encoding, LLVM bitcode. LLVM's IR language is quite complex, so we're pruning out and expanding out a lot of features in order to ensure that PNaCl will be maintainable in the long term and to reduce executable sizes.

As an example, consider this fragment of C code:

    struct foo {
      int x;
      int y;
    };

    void func(struct foo *ptr) {
      ptr->y = 123;
    }

Normally in LLVM this compiles to the following LLVM assembly code:

    %struct.foo = type { i32, i32 }

    define void @func(%struct.foo* nocapture %ptr) #0 {
    entry:
      %y = getelementptr inbounds %struct.foo* %ptr, i32 0, i32 1
      store i32 123, i32* %y, align 4, !tbaa !0
      ret void
    }

    attributes #0 = { nounwind }
    !0 = metadata !{metadata !"int", metadata !1}

With PNaCl, we strip out the types, names and metadata so that this becomes:

    define internal void @67(i32) {
      %2 = add i32 %0, 4
      %3 = inttoptr i32 %2 to i32*
      store i32 123, i32* %3, align 1
      ret void
    }

The definition of the struct type goes away. The "getelementptr" instruction is how LLVM handles pointer arithmetic for indexing into structs and arrays -- this gets expanded out and is replaced with pointer arithmetic on integers. The "%struct.foo*" pointer type is replaced with the i32 type. The "inttoptr" instruction remains as a vestige of LLVM's type system: every "load" or "store" instruction in a PNaCl executable is preceded by an "inttoptr" so that the program remains well-typed by LLVM's rules.

This is actually rather similar to what Emscripten and asm.js do. Like PNaCl, Emscripten is a system for running C/C++ programs in the web browser, but it compiles programs down to Javascript rather than using LLVM bitcode as an interchange format. Both PNaCl and Emscripten use Clang as a C/C++ front end, and run LLVM optimisation passes on the resulting LLVM IR code. In compiling to Javascript, Emscripten performs similar simplification steps for lowering IR, and asm.js codifies the resulting Javascript subset.

Native Client's NTDLL patch on x86-64 Windows

2012-09-29T03:54:00.000+01:00

Last year, I found a security hole in Native Client on 64-bit Windows that could be used to escape from the Native Client sandbox. Fortunately I found the hole before Native Client was enabled by default in Chrome. I recently wrote up the document below to explain the bug and how we fixed it.

Native Client currently relies on a small, process-local, in-memory patch to NTDLL on 64-bit versions of Windows in order to prevent a problem that would create a hole in Native Client's x86-64 sandbox.

The problem

The problem arises because Windows does not have an equivalent of Unix's sigaltstack() system call.

On Unix, a POSIX signal handler may be registered using signal() or sigaction(). When a process faults (e.g. as a result of a memory access error or an illegal instruction), the kernel passes control to the signal handler that is registered for the process. Normally, the signal handler is run on the stack of the thread that faulted. However, if an "alternate signal stack" has been registered using sigaltstack(), the signal handler will run on that stack instead.

On Windows, "vectored exception handlers" are similar to Unix signal handlers, except that a vectored exception handler always gets run on the thread's current stack. This might have been OK, except that Windows has a different division of responsibility between kernel and userland compared with Unix. Windows does more in userland, and this creates problems for NaCl.

On Unix, sigaction() is a syscall that user code calls that records a user-code signal handler function in a kernel data structure. If no signal handler is registered by user code, then when a fault occurs in a process, the kernel never passes control back to userland code in that process.

On Windows, AddVectoredExceptionHandler() is provided in userland by a DLL rather than being provided by the kernel. AddVectoredExceptionHandler() adds the handler function to a list in userland memory. When a fault occurs in the process, the Windows kernel passes control to a function called KiUserExceptionDispatcher in NTDLL, in userland. KiUserExceptionDispatcher checks whether a handler function was registered via AddVectoredExceptionHandler() (or via various other mechanisms), and if so, it calls the handler; otherwise, it terminates the process. The kernel therefore calls KiUserExceptionDispatcher regardless of whether any vectored exception handlers were registered.

This is a problem for NaCl because when we are running sandboxed code, %rsp points into untrusted address space. If sandboxed code faults on a bad memory access or a HLT instruction (which it is allowed to do), Windows calls KiUserExceptionDispatcher, which then runs on the untrusted stack. Since NaCl allows multi-threading, other sandboxed threads can modify the untrusted stack that KiUserExceptionDispatcher is running on.

Furthermore, KiUserExceptionDispatcher is a complex function which calls other NTDLL functions internally. These functions return by using the RET instruction to pop a return address off the stack, which means that control flow is determined by values stored on the untrusted stack. An attacker running as sandboxed code in another thread can modify the memory location containing a return address and so cause the NTDLL routine to jump to an address of the attacker's choice. This allows the attacker to cause a jump to a non-bundle-aligned address, which allows the attacker to escape from the NaCl sandbox.

The fix: patching NTDLL

Our fix is to patch the KiUserExceptionDispatcher entry point inside NTDLL so that this routine will safely terminate the process.

Our patching is process-local: it does not affect the NTDLL.DLL that other processes use. We only modify the in-memory copy of NTDLL.DLL that is loaded into the NaCl process.

Patching is necessary because the address of KiUserExceptionDispatcher is fixed in the Windows kernel at boot time. We don't have any way to tell the kernel to use a routine at a different address. The kernel randomizes the address of NTDLL.DLL at boot time, but thereafter NTDLL.DLL is loaded at the same address in all processes and the address of KiUserExceptionDispatcher is fixed.

What do we do to terminate the process safely? While we could overwrite the start of KiUserExceptionDispatcher with a HLT instruction, this instruction will simply cause the kernel to re-run KiUserExceptionDispatcher again, because the kernel does not track whether a thread is currently handling an exception. This would cause the kernel to repeatedly push a stack frame containing saved register state and re-call KiUserExceptionDispatcher until the thread runs out of stack. Although this will eventually terminate the process, it would be slower than we would like.

So instead, the original version of our patch overwrites the start of KiUserExceptionDispatcher with:

  mov $0, %esp
  hlt

We set the stack pointer to zero first so that the kernel can't write another stack frame. If a fault occurs when the kernel is unable to write a stack frame to memory, we observe that the kernel does not attempt to call KiUserExceptionDispatcher and instead it just terminates the process.

Breakpad crash reporting

We extended the patch to support crash reporting for crashes in trusted code. (In Chrome, this uses the Breakpad crash reporting system.)

The extended version of the patched KiUserExceptionDispatcher looks at a thread-local variable to check whether the fault occurred inside trusted code or sandboxed code. If trusted code faulted, we know we are safely running on a trusted stack, and we pass control back to the original unpatched version of KiUserExceptionDispatcher (which eventually causes the Breakpad crash handler to run and report the crash). Otherwise, we assume %rsp points to an unsafe untrusted stack, and we terminate the process using the same "mov $0, %esp; hlt" sequence as before.

x86-32 Windows

The problem does not arise on 32-bit versions of Windows because of the way NaCl's x86-32 sandbox uses segment registers.

On 32-bit Windows, when sandboxed code faults, the kernel notices that the %ss register contains a non-default value, and it terminates the process without attempting to write a stack frame or call KiUserExceptionDispatcher.

Limitations

Our NTDLL patch has a couple of limitations:

We are not using a stable interface. KiUserExceptionDispatcher is a Windows-internal interface between the Windows kernel and NTDLL, and it might change in future versions of Windows. We are relying on undocumented behaviour.
We don't control what data the Windows kernel writes to the untrusted stack. This data is momentarily visible to other sandboxed threads before the process is terminated, and it will be easily visible in embeddings of NaCl that support shared memory and multiple processes. We don't think the kernel writes any sensitive data in this stack frame, and seems unlikely that it would do so in the future, but it's still possible.

Possible alternatives

There are a couple of alternatives to patching NTDLL:

Using the Windows debugging API: It is possible to catch a fault in sandboxed code before KiUserExceptionDispatcher is run using the Windows debugging API. We use this API for implementing untrusted hardware exception handling.
However, using this API comes at a cost: if a process has a debugger attached, Windows will temporarily suspend the whole process each time it creates or exits a thread. This would slow down thread creation and teardown.
In practice, we wouldn't want to completely replace the NTDLL patch with use of the Windows debugging API. For defence-in-depth, we would want to use both.
Changing the sandbox model: We could change the sandbox model so that %rsp does not point into untrusted address space. One option is to use a different register as the untrusted stack pointer, and disallow use of CALL, PUSH and POP. Another option would be to have %rsp point to a thread-private stack (outside of the sandbox's usual address space). Both options would be quite invasive changes to the sandbox model.

Conclusion

Windows runs an internal function on the userland stack whenever a hardware exception occurs. This behaviour does not appear to be documented anywhere, which suggests that Windows treats this use of the userland stack as an implementation detail rather than part of the interface. For most applications, this use of the userland stack does not matter. Native Client is unusual in that this stack usage creates a hole in NaCl's sandbox, which we have to work around.

In addition to being undocumented, Windows' use of the userland stack can be surprising to people from a Unix background, given that Unix signal handlers behave differently. However, when one considers the complexity of Windows' exception handling interfaces -- which include Structured Exception Handlers, Vectored Exception Handlers and UnhandledExceptionFilters -- it makes more sense that this complex logic would live in userland.

When it comes to the kernel/userland boundary, what can be an implementation detail for one person can be an important interface detail for another. The lesson is that these details are often critical for the security of a sandbox.

References

The original sandbox hole: issue 1633 (filed in April 2011 and fixed in June 2011)

Breakpad crash reporting support for 64-bit Windows: issue 2095

Untrusted hardware exception handling for 64-bit Windows: issue 2536

Documentation for AddVectoredExceptionHandler() on MSDN

Stack unwinding risks on 64-bit Windows

2011-11-19T20:24:00.007+00:00

Recently, I've been looking at how x86-64 Windows does stack unwinding in 64-bit processes, and I've found some odd behaviour. If the stack unwinder finds a return address on the stack that does not have associated unwind info, it applies a fallback unwind rule that does not make much sense.

I've been wondering if this could make some x86-64 programs more easily exploitable if they corrupt the stack or if they use x86-64 assembly code that does not have unwind info.

In pseudocode, the current unwind logic looks something like this:

void unwind_stack(struct register_state regs) {
  while (???) {
    unwind_info = get_unwind_info(regs.rip);
    if (unwind_info) {
      if (has_exception_handler(unwind_info) {
        // Run exception handler...
      }
      regs = unwind_stack_frame(unwind_info, regs);
    }
    else {
      // Fallback case for leaf functions:
      regs.rip = *(uint64_t *) regs.rsp;
      regs.rsp += 8;
    }
  }
}

The issue is that the fallback case only makes sense for the first iteration of the loop. The fact that it is applied to later iterations is probably just sloppy programming.

The fallback case is intended to handle "leaf functions". This means functions that:

do not adjust %rsp, and
do not call other functions.

These two properties are related: if a function calls other functions, it must adjust %rsp first, otherwise it does not conform to the x86-64 ABI.

Since the fallback case is applied repeatedly, the unwinder will happily interpret the stack as a series of return addresses with no gaps between them:

     ...
     8 bytes   return address
     8 bytes   return address
     8 bytes   return address
     ...

However, those are not valid stack frames in the x86-64 Windows ABI. A valid stack frame for a non-leaf function Foo() (i.e. a function that calls other functions) looks like this:

  ------------ 16-byte aligned
    32 bytes   "shadow space" (scratch space for Foo())
     8 bytes   return address (points into Foo()'s caller)
     8 bytes   scratch space for Foo()
  16*n bytes   scratch space for Foo() (for some n >= 0)
    32 bytes   "shadow space" (scratch space for Foo()'s callees)
  ------------ 16-byte aligned

This layout comes from two requirements in the Windows x86-64 ABI:

%rsp must be 8mod16 on entry to a function. (This means %rsp should be 16-byte aligned before a CALL instruction.) This requirement is common to Windows and Linux.
The caller must also reserve 32 bytes of "shadow space" on the stack above the return address, which the callee can use as scratch space. (The x86-64 ABI on Linux does not have this requirement.)

This means that a function that does this:

  bar:
    call qux
    ret

cannot be valid, because:

it does not align the stack correctly on entry to qux();
more seriously, it does not allocate shadow space, so the return address that bar()'s RET instruction jumps to could have been overwritten by qux().

Yet the Windows unwinder treats the resulting stack frame as valid.

The upshot of the the fallback rule is that when the unwinder reaches a return address that does not have unwind info, it will scan up the stack looking for a value that does, looking at each 64-bit value in turn. The unwinder seems to lack basic sanity checking, so it does not stop even if it hits a zero value (which clearly cannot be a valid return address).

This has a tendency to mask mistakes. If you have an assembly function with incorrect or missing unwind info, the unwinder will tend to scan past uninitialised values or scratch data on the stack and recover at a higher-up stack frame.

The risky part is that the unwinder will be interpreting values as return addresses when they aren't return addresses at all. If the unwinder hits an address whose unwind info has an associated exception handler, the unwinder will jump to the handler. This is extra risky because language-level exceptions (e.g. for C++) and hardware exceptions are handled by the same mechanism on Windows, known as Structured Exception Handling (SEH). Both result in stack unwinding on x86-64. This means that null pointer dereferences trigger stack unwinding. This is unlike Linux, where hardware exceptions don't normally trigger stack unwinding.

This means that an attacker might be able to do the following:

find a function F containing an exception handler that does something interesting;
guess the address of the function F (more specifically, the address inside F that's inside F's __try block);
find another function G with incorrect or missing unwind info;
arrange for the address of F to be in G's stack frame;
cause a hardware exception (e.g. a null pointer dereference) to occur while G is on the stack;
and therefore cause the exception handler in F to get run.

The x86-64 version of MinGW (a GNU toolchain for Windows) is susceptible to this, because its version of GCC doesn't generate unwind info. Here's an example of how a null pointer dereference can cause an exception handler to be run:

prog.c (compile with x86-64 MinGW's gcc):

#include <stdio.h>
#include <stdlib.h>

long long get_addr();

void exc_handler() {
  printf("in exception handler!\n");
  exit(0);
}

void my_func(long long x) {
  // This might be necessary to force x to be written to the stack:
  // printf("stack=%p\n", &x);

  // This should normally kill the process safely, but it doesn't here:
  *(int *) 0 = 0;
}

int main() {
  my_func(get_addr());
  return 0;
}

get_unwind_ret.asm (assemble with Microsoft's x86-64 assembler):

PUBLIC get_addr
EXTRN exc_handler:PROC

_TEXT SEGMENT

; Helper function for getting the address inside the function below.
get_addr PROC
        lea rax, ret_addr
        ret
get_addr ENDP

; Innocent function that uses an exception handler.
blah PROC FRAME :exc_handler
.endprolog
ret_addr LABEL PTR
        hlt
blah ENDP

_TEXT ENDS

END

Here's what I get on Cygwin:

$ ml64 /c get_unwind_ret.asm
$ x86_64-w64-mingw32-gcc get_unwind_ret.obj prog.c -o prog
$ ./prog
in exception handler!

This works because GCC generates the following code without unwind info:

$ x86_64-w64-mingw32-gcc prog.c -S -o -
...
my_func:
        pushq %rbp
        movq %rsp, %rbp
        movq %rcx, 16(%rbp)
        movl $0, %eax
        movl $0, (%rax)
        leave
        ret
...

my_func()'s argument is written to the shadow space above the return address that points to main(), but since main() doesn't have unwind info either, the spilled argument gets interpreted as a return address by the unwinder.

Some conclusions:

Be extra careful if you're writing x86-64 assembly code on Windows.
Be careful if you're using GCC to generate x86-64 code on Windows and check whether it is producing unwind info. (There is no problem on x86-32 because Windows does not use stack unwinding for exception handling on x86-32.)
Windows' x86-64 stack unwinder should be changed to be stricter. It should stop if it encounters a return address without unwind info.

ARM cache flushing & doubly-mapped pages

2011-11-17T08:40:00.002+00:00

If you're familiar with the ARM architecture you'll probably know that self-modifying code has to be careful to flush the instruction cache on ARM. (Back in the 1990s, the introduction of the StrongARM with its split instruction and data caches broke a lot of programs on RISC OS.)

On ARM Linux there's a syscall, cacheflush, for flushing the instruction cache for a range of virtual addresses.

This syscall works fine if you map some code as RWX (read+write+execute) and execute it from the same virtual address that you use to modify it. This is how JIT compilers usually work.

In the Native Client sandbox, though, for dynamic code loading support, we have code pages that are mapped twice, once as RX (read+execute) and once as RW (read+write). Naively you'd expect that after you write instructions to the RW address, you just have to call cacheflush on the RX address. However, that's not enough. cacheflush doesn't just clear the i-cache. It also flushes the data cache to ensure that writes are committed to memory so that they can be read back by the i-cache. The two parts are clear if you look at the kernel implementation of cacheflush, which does two different MCRs on the address range.

I guess the syscall interface was not designed with double-mapped pages in mind, since it doesn't allow the i-cache and d-cache to be flushed separately. For the time being, Native Client will have to call cacheflush on both the RW and RX mappings.

See NaCl issue 2443 for where this came up.

Fixing the trouble with Buildbot

2011-08-23T16:59:00.003+01:00

Last year I wrote a blog post, "The trouble with Buildbot", about how Buildbot creates a dilemma for complex projects because it forces you to choose between two ways of describing a project's build steps:

You can describe build steps in the Buildbot config. Buildbot configs are awkward to update -- someone has to restart the Buildbot master -- and hard to test, but you get the benefit that the build steps appear as separate steps in Buildbot's display.
You can write a script which runs the build steps directly, and check it into the same repository as your project. This is easier to maintain and test, but traditionally all the output from the script would appear as a single Buildbot build step, making the output hard to read.

Fortunately, Brad Nelson has addressed this problem with an extension to Buildbot known as "Buildbot Annotations". The Python code for this currently lives in chromium_step.py (see AnnotatedCommand).

The idea is that your checked-in script will run multiple steps sequentially but output tags between them (e.g. "@@@BUILD_STEP tests@@@") so that the output can be parsed into chunks by the Buildbot master, and displayed as separate chunks.

For example, an early version of Native Client's Annotations-based buildbot script looked something like this:

...
echo @@@BUILD_STEP gyp_compile@@@
make -C .. -k -j12 V=1 BUILDTYPE=${GYPMODE}

echo @@@BUILD_STEP scons_compile${BITS}@@@
./scons -j 8 -k --verbose ${GLIBCOPTS} --mode=${MODE}-host,nacl \
    platform=x86-${BITS}

echo @@@BUILD_STEP small_tests${BITS}@@@
./scons -k --verbose ${GLIBCOPTS} --mode=${MODE}-host,nacl small_tests \
    platform=x86-${BITS} ||
    { RETCODE=$? && echo @@@STEP_FAILURE@@@;}
...

(More recently, this shell script has been replaced with a Python script.)

You can see this in use on the Native Client Buildbot page (and also on the trybot page, though that's less readable).

The logic for running NaCl's many build steps -- including a clobber step, a Scons build, a Gyp build, small_tests, medium_tests, large_tests, chrome_browser_tests etc. -- used to live in the Buildbot config, and we'd usually have to get Brad Nelson to change it on our behalf. Brad would have to restart the buildbot masters manually, and this would halt any builds that were in progress, including trybot jobs.

Now the knowledge of these build steps has moved into scripts that are checked into the native_client repo, which can easily be updated. We can change the scripts at the same time as changing other code, with an atomic SVN commit. Changes can be tested via the trybots.

Chromium is not using Buildbot Annotations yet for its buildbot, but it would be good to switch it over. One obstacle is timeout handling. Buildbot's build steps can have separate timeouts, and the Buildbot build slave is responsible for terminating a build step's subprocess(es) if they take too long. With Buildbot Annotations, the responsibility for doing per-step timeouts would move to the checked-in build script.

The current Annotations output format has some down sides:

The syntax is simple but kind of ugly.
It's not possible to nest build steps.
It's not possible to interleave output from two concurrent build steps.

Overall, Annotations reduces our dependence on Buildbot. If there were a simpler, more scalable alternative to Buildbot that also supported the Annotations format, we could easily switch to it because we our Buildbot config is not as complex as it used to be.

Cookies versus the Chrome sandbox

2011-02-10T01:34:00.005+00:00

Although Chrome's sandbox does not protect one web site from another in general, it can provide such protection in some cases. Those cases are ones in which HTTP cookies are either reduced in scope or not used at all. One lesson we could draw from this is that cookies reduce the usefulness of Chrome's sandbox.

The scenario we are exploring supposes that there is a vulnerability in Chrome's renderer process, and that the vulnerability lets a malicious site take control of the renderer process. This means that all the restrictions that are normally enforced on the malicious site by the renderer process are stripped away, and all we are left with are the restrictions enforced on the renderer process by the Chrome browser process and the Chrome sandbox.

In my previous blog post, I explained how an attacker site, evil.com, that manages to exploit the renderer process could steal the login cookies from another site, mail.com, and so gain access to the user's e-mail.

The attack is made possible by the combination of two features:

cookies
frames

Chrome currently runs a framed page in the same renderer process as the parent page. HTML standards allow framed pages to access cookies, so the browser process has to give the renderer process access to the cookies for both pages.

Because this problem arises from the interaction of these features, one site is not always vulnerable to other sites. There should be a couple of ways that users and sites can mitigate the problem, without changing Chrome. Firstly, the user can change how cookies are scoped within the browser by setting up multiple profiles. Secondly, a site can skirt around the problem by not using cookies at all. We discuss these possibilities below.

Use multiple profiles: As a user, you can create multiple browser profiles, and access mail.com and evil.com in separate profiles.
Chrome does not make this very easy at the moment. It provides a command line option (--user-data-dir) for creating more profiles, but this feature is not available through the GUI. Chrome's GUI provides just two profiles: one profile (persistent) for normal windows, plus an extra profile (non-persistent) for windows in "incognito" mode.

So, you could log in to mail.com in a normal window and view evil.com in an incognito window, or vice versa. This is safer because cookies registered by a web site in one profile are not accessible to sites visited in another profile. Each profile has a separate pool of cookies. This feature of browser profiles means you can log into one mail.com account in incognito mode and a different mail.com account in normal mode.

It would be interesting to see if this profile separation could be automated.
Use web-keys instead of cookies: The developers of mail.com could defend against evil.com by not using cookies to track user logins. Instead mail.com could use web-keys. Logging in to mail.com would take you to a URL containing an unguessable token. Such a URL is known as a "web-key". (For technical reasons, the unguessable token should be in the "#" fragment part of the URL.)

This is safer because even if evil.com compromises the renderer process, the Chrome browser process generally does not give the renderer process a way to enumerate other tabs, discover other tabs' URLs, or enumerate the user's browsing history and bookmarks.

Using web-keys has been proposed before to address a different (but related) web security problem, clickjacking. (See Tyler Close's essay, The Confused Deputy Rides Again.)

Using web-keys would change the user interface for mail.com a little. Cookies are the mechanism by which entering "mail.com" in the URL bar can take you directly to the inbox of the e-mail account you are logged in to. Whenever the browser sees "mail.com" as the domain name it automatically adds the cookies for "mail.com" to the HTTP request. (This automatic attachment of credentials makes this a type of ambient authority.) This mechanism adds some convenience for the user, but it is also a means by which evil.com can attack mail.com, because whenever evil.com links to mail.com, the cookies get added to the request too. The browser does not distinguish between a URL entered by the user in the address bar and a URL provided by another site like evil.com.

So if mail.com were changed to remove its use of cookies, you would lose the small convenience of being able to get to your inbox directly by typing "mail.com" without re-logging-in. Is there a way to get this convenience back? An alternative to using cookies to record logins is to use bookmarks. If mail.com uses web-key URLs, you could bookmark the address of the inbox page. To get to your inbox without re-logging-in, you would select the bookmark.

These days the Chrome address bar accepts more than just URLs (which is why the address bar is actually called the "omnibar"), so you could imagine that entering "mail.com" would search your bookmarks and jump to your bookmarked inbox page rather than being treated as the URL "http://mail.com".

Conclusions

There are a couple of possible conclusions we could draw from this:

Cookies weaken the Chrome sandbox.
Frames weaken the Chrome sandbox.

If we blame cookies, we should look critically at other browser features that behave similarly to cookies by providing ambient authority. For example, the new LocalFileSystem feature (an extension of the File API) provides local storage. The file store it provides is per-origin and so is based on ambient authority. If mail.com uses this to cache e-mail, and evil.com exploits the renderer process, then evil.com will be able to read and modify the cached e-mail. There are other local storage APIs (IndexedDB and Web Storage), but they are based on ambient authority too. From this perspective, the situation is getting worse.

If we blame frames, this suggests that browsers should fix the problem by implementing site isolation. Site isolation means that the browser would put a framed page in a different renderer process from a parent page. Microsoft's experimental Gazelle browser implements site isolation but breaks compatibility with the web. It remains to be seen whether a web browser can implement site isolation while retaining compatibility and good performance.

Either way, concerned users and web app authors need to know how browser features are implemented if they are to judge how much security the browser can provide. That's not easy, because the web platform is so complicated!

A common misconception about the Chrome sandbox

2010-12-21T00:34:00.005+00:00

A common misconception about the Chrome web browser is that its sandbox protects one web site from another.

For example, suppose you are logged into your e-mail account on mail.com in one tab, and have evil.com open in another tab. Suppose evil.com finds an exploit in the renderer process, such as a memory safety bug, that lets it run arbitrary code there. Can evil.com get hold of your HTTP cookies for mail.com, and thereby access your e-mail account?

Unfortunately, the answer is yes.

The reason is that mail.com and evil.com can be assigned to the same renderer process. The browser does not only do this to save memory. evil.com can cause this to happen by opening an iframe on mail.com. With mail.com's code running in the same exploited renderer process, evil.com can take it over and read the cookies for your mail.com account and use them for its own ends.

There are a couple of reasons why the browser puts a framed site in the same renderer process as the parent site. Firstly, if the sites were handled by separate processes, the browser would have to do costly compositing across renderer processes to make the child frame appear inside the parent frame. Secondly, in some cases the DOM allows Javascript objects in one frame to obtain references to DOM objects in other frames, even across origins, and it is easier for this to be managed within one renderer process.

I don't say this to pick on Chrome, of course. It is better to have the sandbox than not to have it.

Chrome has never claimed that the sandbox protects one site against another. In the tech report "The Security Architecture of the Chromium Browser" (Barth, Jackson, Reis and the Chrome Team; 2008), "Origin isolation" is specifically listed under "Out-of-scope goals". They state that "an attacker who compromises the rendering engine can act on behalf of any web site".

There are a couple of ways that web sites and users can mitigate this problem, which I'll discuss in another post. However, in the absence of those defences, what Chrome's multi-process architecture actually gives you is the following:

Robustness if a renderer crashes. Having multiple renderer processes means that a crash of one takes down only a limited number of tabs, and the browser and the other renderers will survive. It also helps memory management.
But we can get this without sandboxing the renderers.
Protection of the rest of the user's system from vulnerabilities in the renderer process. For example, the sandboxed renderer cannot read any of the user's files, except for those the user has granted through a "File Upload" file chooser.
But we can get this by sandboxing the whole browser (including any subprocesses), without needing to have the browser separated from the renderer.
For example, since 2007 I have been running Firefox under Plash (a sandbox), on Linux.
In principle, such a sandbox should be more effective at protecting applications and files outside the browser than the Chrome sandbox, because the sandbox covers all of the browser, including its network stack and the so-called browser "chrome" (this means the parts of the GUI outside of the DOM).
In practice, Plash is not complete as a sandbox for GUI apps because it does not limit access to the X Window System, so apps can do things that X allows such as screen scraping other apps and sending them input.

The main reason Chrome was developed to sandbox its renderer processes but not the whole browser is that this is easier to implement with sandboxing technologies that are easily deployable today. Ideally, though, the whole browser would be sandboxed. One of the only components that would stay unsandboxed, and have access to all the user's files, would be the "File Open" dialog box for choosing files to upload.

When printf debugging is a luxury

2010-12-18T19:06:00.002+00:00

Inserting printf() calls is often considered to be a primitive fallback when other debugging tools are not available, such as stack backtraces with source line numbers.

But there are some situations in low-level programming where most libc calls don't work and so even printf() and assert() are unavailable luxuries. This can happen:

when libc is not properly initialised yet;
when we writing code that is called by libc and cannot re-enter libc code;
when we are in a signal handler;
when only limited stack space is available;
when we cannot allocate memory for some reason; or
when we are not even linked to libc.

Here's a fragment of code that has come in handy in these situations. It provides a simple assert() implementation:

#include <string.h>
#include <unistd.h>

static void debug(const char *msg) {
  write(2, msg, strlen(msg));
}

static void die(const char *msg) {
  debug(msg);
  _exit(1);
}

#define TO_STRING_1(x) #x
#define TO_STRING(x) TO_STRING_1(x)

#define assert(expr) {                                                        \
  if (!(expr)) die("assertion failed at " __FILE__ ":" TO_STRING(__LINE__)    \
                   ": " #expr "\n"); }

By using preprocessor trickery to construct the assertion failure string at compile time, it avoids having to format the string at runtime. So it does not need to allocate memory, and it doesn't need to do multiple write() calls (which can become interleaved with other output in the multi-threaded case).

Sometimes even libc's write() is a luxury. In some builds of GNU libc on Linux, glibc's syscall wrappers use the TLS register (%gs on i386) to fetch the address of a routine for making syscalls.

However, if %gs is not set up properly for some reason, this will fail. For example, for Native Client's i386 sandbox, %gs is set to a different value whenever sandboxed code is running, and %gs stays in this state if sandboxed code faults and triggers a signal handler. In Chromium's seccomp-sandbox, %gs is set to zero in the trusted thread.

In those situations we have to bypass libc and do the system calls ourselves. The following snippet comes from reference_trusted_thread.cc. The sys_*() functions are defined by linux_syscall_support.h, which provides wrappers for many Linux syscalls:

#include "linux_syscall_support.h"

void die(const char *msg) {
  sys_write(2, msg, strlen(msg));
  sys_exit_group(1);
}

An introduction to FreeBSD-Capsicum

2010-11-04T12:48:00.004+00:00

In my last blog post, I described one of the features in FreeBSD-Capsicum: process descriptors. Now it's time for an overview of Capsicum.

Capsicum is a set of new features for FreeBSD that adds better support for sandboxing, using a capability model in which the capabilities are Unix file descriptors (FDs).

Capsicum takes a fairly conservative approach, in that it does not make operations on file descriptors virtualisable. This approach has some limitations -- we do not get the advantages of having purely message-passing syscalls. However, it does mean that the new features are orthogonal.

The main new features are:

A per-process "capability mode", which is turned on via a new cap_enter() syscall.
This mode disables any system call that provides ambient authority. So it disables system calls that use global namespaces, including the file namespace (e.g. open()), the PID namespace (e.g. kill()) and the network address namespace (e.g. connect()).
This is not just a syscall filter, though. Some system calls optionally use a global namespace. For example, sendmsg() and sendto() optionally take a socket address. For openat(), the directory FD can be omitted. Capability mode disables those cases.
Furthermore, capability mode disallows the use of ".." (parent directory) in filenames for openat() and the other *at() calls. This changes directory FDs to be limited-authority objects that convey access to a specific directory and not the whole filesystem. (It is interesting that this appears to be a property of the process, via capability mode, rather than of the directory FD itself.)
Capability mode is inherited across fork and exec.
Finer-grained permissions for file descriptors. Each FD gets a large set of permission bits. A less-permissive copy of an FD can be created with cap_new(). For example, you can have read-only directory FDs, or non-seekable FDs for files.
Process descriptors. Capsicum doesn't allow kill() inside the sandbox because kill() uses a global namespace (the PID namespace). So Capsicum introduces process descriptors (a new FD type) as a replacement for process IDs, and adds pdfork(), pdwait() and pdkill() as replacements for fork(), wait() and kill().

Plus there are a couple of smaller features:

Message-based sockets. The Capsicum guys implemented Linux's SOCK_SEQPACKET interface for FreeBSD.
An fexecve() system call which takes a file descriptor for an executable. This replaces execve(), which is disabled in capability mode because execve() takes a filename.
Capsicum's fexecve() ignores the implicit filename that is embedded in the executable's PT_INTERP field, so it is only good for loading the dynamic linker directly or for loading other statically linked executables.

Currently, the only programs that run under Capsicum are those that have been ported specially:

The Capsicum guys ported Chromium, and it works much the same way as on Linux. On both systems, Chromium's renderer process runs sandboxed, but the browser process does not. On both systems, Chromium needs to be able to turn on sandboxing after the process has started up, because it relies on legacy libraries that use open() during startup.
Some Unix utilities, including gzip and dhclient, have been extended to use sandboxing internally (privilege separation). Like Chromium, gzip can open files and then switch to capability mode.

However, it should be possible to run legacy Unix programs under Capsicum by porting Plash.

At first glance, it looks like Plash would have to do the same tricks under FreeBSD-Capsicum as it does under Linux to run legacy programs. Under Linux, Plash uses a modified version of glibc in order to intercept its system calls and convert them to system calls that work in the sandbox. That's because the Linux kernel doesn't provide any help with intercepting the system calls. The situation is similar under FreeBSD -- Capsicum does not add any extensions for bouncing syscalls back to a user space handler.

However, there are two aspects of FreeBSD that should make Plash easier to implement there than on Linux:

FreeBSD's libc is friendlier towards overriding its functions. On both systems, it is possible to override (for example) open() via an LD_PRELOAD library that defines its own "open" symbol. But with glibc on Linux, this doesn't work for libc's internal calls to open(), such as from fopen(). For a small gain in efficiency, these calls don't go through PLT entries and so cannot be intercepted.
FreeBSD's libc doesn't use this optimisation and so it allows the internal calls to be intercepted too.
FreeBSD's dynamic linker and libc are not tightly coupled, so it is possible to change the dynamic linker to open its libraries via IPC calls without having to rebuild libc in lockstep.
In contrast, Linux glibc's ld.so and libc.so are built together, share some data structures (such as TLS), and cannot be replaced independently.

Process descriptors in FreeBSD-Capsicum

2010-10-23T16:10:00.003+01:00

Capsicum is a set of new features for FreeBSD that adds better support for sandboxing, adding a capability mode in which the capabilities are Unix file descriptors (FDs). The features Capsicum adds are orthogonal, which is nice. One of the new features is process descriptors.

Capsicum adds a replacement for fork() called pdfork(), which returns a process descriptor (a new type of FD) rather than a PID. Similarly, there are replacements for wait() and kill() -- pdwait() and pdkill() -- which take FDs as arguments instead of PIDs.

The reason for the new interface is that kill() is not safe to allow in Capsicum's sandbox, because it provides ambient authority: it looks up its PID argument in a global namespace.

But even if you ignore sandboxing issues, this new interface is a significant improvement on POSIX process management:

It allows the ability to wait on a process to be delegated to another process. In contrast, with wait()/waitpid(), a process's exit status can only be read by the process's parent.
Process descriptors can be used with poll(). This avoids the awkwardness of having to use SIGCHLD, which doesn't work well if multiple libraries within the same process want to wait() for child processes.
It gets rid of the race condition associated with kill(). Sending a signal to a PID is dodgy because the original process with this PID could have exited, and the kernel could have recycled the PID for an unrelated process, especially on a system where processes are spawned and exit frequently.
kill() is only really safe when used by a parent process on its child, and only when the parent makes sure to use it before wait() has returned the child's exit status. pdkill() gets rid of this problem.
In future, process descriptors can be extended to provide access to the process's internal state for debugging purposes, e.g. for reading registers and memory, or modifying memory mappings or the FD table. This would be an improvement on Linux's ptrace() interface.

However, there is one aspect of Capsicum's process descriptors that I think was a mistake: Dropping the process descriptor for a process causes the kernel to kill it. (By this I mean that if there are no more references to the process descriptor, because they have been close()'d or because the processes holding them have exited, the kernel will terminate the process.)

The usual principle of garbage collection in programming languages is that GC should not affect the observable behaviour of the program (except for resource usage). Capsicum's kill-on-close() behaviour violates that principle.

kill-on-close() is based on the assumption that the launcher of a subprocess always wants to check the subprocess's exit status. But this is not always the case. Exit status is just one communications channel, but not one we always care about. It's common to launch a process to communicate with it via IPC without wanting to have to wait() for it. (Double fork()ing is a way to do this in POSIX without leaving a zombie process.) Many processes are fire-and-forget.

The situation is analogous for threads. pthreads provides pthread_detach() and PTHREAD_CREATE_DETACHED as a way to launch a thread without having to pthread_join() it.

In Python, when you create a thread, you get a Thread object back, but if the Thread object is GC'd the thread won't be killed. In fact, finalisation of a thread object is implemented using pthread_detach() on Unix.

I know of one language implementation that (as I recall) will GC threads, Mozart/Oz, but this only happens when a thread can have no further observable effects. In Oz, if a thread blocks waiting for the resolution of a logic variable that can never be resolved (because no other thread holds a reference to it), then the thread can be GC'd. So deadlocked threads can be GC'd just fine. Similarly, if a thread holds no communications channels to the outside world, it can be GC'd safely.

Admittedly, Unix already violates this GC principle by the finalisation of pipe FDs and socket FDs, because dropping one endpoint of the socket/pipe pair is visible as an EOF condition on the other endpoint. However, this is usually used to free up resources or to unblock a process that would otherwise fail to make progress. Dropping a process descriptor would halt progress -- the opposite. Socket/pipe EOF can be used to do this too, but it is less common, and I'm not sure we should encourage this.

However, aside from this complaint, process descriptors are a good idea. It would be good to see them implemented in Linux.

My workflow with git-cl + Rietveld

2010-08-11T12:49:00.005+01:00

Git's model of changes (which is shared by Mercurial, Bazaar and Monotone) makes it awkward to revise earlier patches. This can make things difficult when you are sending out multiple, dependent changes for code review.

Suppose I create changes A and B. B depends functionally on A, i.e. tests will not pass for B without A also being applied. There might or might not be a textual dependency (B might or might not modify lines of code modified by A).

Because code review is slow (high latency), I need to be able to send out changes A and B for review and still be able to continue working on further changes. But I also need to be able to revisit A to make changes to it based on review feedback, and then make sure B works with the revised A.

What I do is create separate branches for A and B, where B branches off of A. To revise change A, I "git checkout" its branch and add further commits. Later I can update B by checking it out and rebasing it onto the current tip of A. Uploading A or B to the review system or committing A or B upstream (to SVN) involves squashing their branch's commits into one commit. (This squashing means the branches contain micro-history that reviewers don't see and which is not kept after changes are pushed upstream.)

The review system in question is Rietveld, the code review web app used for Chromium and Native Client development. Rietveld does not have any special support for patch series -- it is only designed to handle one patch at a time, so it does not know about dependencies between changes. The tool for uploading changes from Git to Rietveld and later committing them to SVN is "git-cl" (part of depot_tools).

git-cl is intended to be used with one branch per change-under-review. However, it does not have much support for handling changes which depend on each other.

This workflow has a lot of problems:

When using git-cl on its own, I have to manually keep track that B is to be rebased on to A. When uploading B to Rietveld, I must do "git cl upload A". When updating B, I must first do "git rebase A". When diffing B, I have to do "git diff A". (I have written a tool to do this. It's not very good, but it's better than doing it manually.)
Rebasing B often produces conflicts if A has been squash-committed to SVN. That's because if branch A contained multiple patches, Git doesn't know how to skip over patches from A that are in branch B.
Rebasing loses history. Undoing a rebase is not easy.
In the case where B doesn't depend on A, rebasing branch B so that it doesn't include the contents of branch A is a pain. (Sometimes I will stack B on top of A even when it doesn't depend on A, so that I can test the changes together. An alternative is to create a temporary branch and "git merge" A and B into it, but creating further branches adds to the complexity.)
If there is a conflict, I don't find out about it until I check out and update the affected branch.
This gets even more painful if I want to maintain changes that are not yet ready for committing or posting for review, and apply them alongside changes that are ready for review.

There are all reasons why I would not recommend this workflow to someone who is not already very familiar with Git.

The social solution to this problem would be for code reviews to happen faster, which would reduce the need to stack up changes. If all code reviews reached a conclusion within 24 hours, that would be an improvement. But I don't think that is going to happen.

The technical solution would be better patch management tools. I am increasingly thinking that Darcs' set-of-patches model would work better for this than Git's DAG-of-commits model. If I could set individual patches to be temporarily applied or unapplied to the working copy, and reorder and group patches, I think it would be easier to revisit changes that I have posted for review.

CVS's problems resurface in Git

2010-08-06T21:12:00.002+01:00

Although modern version control systems have improved a lot on CVS, I get the feeling that there is a fundamental version control problem that the modern VCSes (Git, Mercurial, Bazaar, and I'll include Subversion too!) haven't solved. The curious thing is that CVS had sort of made some steps towards addressing it.

In CVS, history is stored per file. If you commit a change that crosses multiple files, CVS updates each file's history separately. This causes a bunch of problems:

CVS does not represent changesets or snapshots as first class objects. As a result, many operations involve visiting every file's history.
Reconstructing a changeset involves searching all files' histories to match up the individual file changes. (This was just about possible, though I hear there are tricky corner cases. Later CVS added a commit ID field that presumably helped with this.)
Creating a tag at the latest revision involves adding a tag to every file's history. Reconstructing a tag, or a time-based snapshot, involves visiting every file's history again.
CVS does not represent file renamings, so the standard history tools like "cvs log" and "cvs annotate" are not able to follow a file's history from before it was renamed.

In the DAG-based decentralised VCSes (Git, Mercurial, Monotone, Bazaar), history is stored per repository. The fundamental data structure for history is a Directed Acyclic Graph of commit objects. Each commit points to a snapshot of the entire file tree plus zero or more parent commits. This addresses CVS's problems:

Extracting changesets is easy because they are the same thing as commit objects.
Creating a tag is cheap and easy. Recording any change creates a commit object (a snapshot-with-history), so creating a tag is as simple as pointing to an already-existing commit object.

However, often it is not practical to put all the code that you're interested in into a single Git repository! (I pick on Git here because, of the DAG-based systems, it is the one I am most familar with.) While it can be practical to do this with Subversion or CVS, it is less practical with the DAG-based decentralised VCSes:

In the DAG-based systems, branching is done at the level of a repository. You cannot branch and merge subdirectories of a repository independently: you cannot create a commit that only partially merges two parent commits.
Checking out a Git repository involves downloading not only the entire current revision, but the entire history. So this creates pressure against putting two partially-related projects together in the same repository, especially if one of the projects is huge.
Existing projects might already use separate repositories. It is usually not practical to combine those repositories into a single repository, because that would create a repo that is incompatible with the original repos. That would make it difficult to merge upstream changes. Patch sharing would become awkward because the filenames in patches would need fixing.

This all means that when you start projects, you have to decide how to split your code among repositories. Changing these decisions later is not at all straightforward.

The result of this is that CVS's problems have not really been solved: they have just been pushed up a level. The problems that occurred at the level of individual files now occur at the level of repositories:

The DAG-based systems don't represent changesets that cross repositories. They don't have a type of object for representing a snapshot across repositories.
Creating a tag across repositories would involve visiting every repository to add a tag to it.
There is no support for moving files between repositories while tracking the history of the file.

The funny thing is that since CVS hit this problem all the time, the CVS tools were better at dealing with multiple histories than Git.

To compare the two, imagine that instead of putting your project in a single Git repository, you put each one of the project's files in a separate Git repository. This would result in a history representation that is roughly equivalent to CVS's history representation. i.e. Every file has its own separate history graph.

To check in changes to multiple files, you have to "cd" to each file's repository directory, and "git commit" and "git push" the file change.
To update to a new upstream version, or to switch branch, you have to "cd" to each file's repository directory again to do "git pull/fetch/rebase/checkout" or whatever.
Correlating history across files must be done manually. You could run "git log" or "gitk" on two repositories and match up the timelines or commit messages by hand. I don't know of any tools for doing this.

In contrast, for CVS, "cvs commit" works across multiple files and (if I remember rightly) even across multiple working directories. "cvs update" works across multiple files.

While "cvs log" doesn't work across multiple files, there is a tool called "CVS Monitor" which reconstructs history and changesets across files.

Experience with CVS suggests that Git could be changed to handle the multiple-repository case better. "git commit", "git checkout" etc. could be changed to operate across multiple Git working copies. Maybe "git log" and "gitk" could gain options to interleave histories by timestamp.

Of course, that would lead to cross-repo support that is only as good as CVS's cross-file support. We might be able to apply a textual tag name across multiple Git repos with a single command just as a tag name can be applied across files with "cvs tag". But that doesn't give us an immutable tag object than spans repos.

My point is that the fundamental data structure used in the DAG-based systems doesn't solve CVS's problem, it just postpones it to a larger level of granularity. Some possible solutions to the problem are DEPS files (as used by Chromium), Git submodules, or Darcs-style set-of-patches repos. These all introduce new data structures. Do any of these solve the original problem? I am undecided -- this question will have to wait for another post. :-)

The trouble with Buildbot

2010-05-05T22:42:00.002+01:00

The trouble with Buildbot is that it encourages you to put rules into a Buildbot-specific build configuration that is separate from the normal configuration files that you might use to build a project (configure scripts, makefiles, etc.).

This is not a big problem if your Buildbot configuration is simple and just consists of, say, "svn up", "./configure", "make", "make test", and never changes.

But it is a problem if your Buildbot configuration becomes non-trivial and ever has to be updated, because the Buildbot configuration cannot be tested outside of Buildbot.

The last time I had to maintain a Buildbot setup, it was necessary to try out configuration changes directly on the Buildbot master. This doesn't work out well if multiple people are responsible for maintaining the setup! Whoever makes a change has to remember to check it in to version control after they've got it working, which of course doesn't always happen. It's a bit ironic that Buildbot is supposed to support automated testing but doesn't follow best practices for testing itself.

There is a simple way around this though: Instead of putting those separate steps -- "./configure", "make", "make test" -- into the Buildbot config, put them into a script, check the script into version control, and have the Buildbot config run that script. Then the Buildbot config just consists of doing "svn up" and running the script. It is then possible to test changes to the script before checking it in. I've written scripts like this that go as far as debootstrapping a fresh Ubuntu chroot to run tests in, which ensures your package dependency list is up to date.

Unfortunately, Buildbot's logging facilities don't encourage having a minimal Buildbot config.

If you use a complicated Buildbot configuration with many Buildbot steps, Buildbot can display each step separately in its HTML-formatted logs. This means:

you can see progress;
you can see which steps have failed;
you'd be able to see how long the steps take if Buildbot actually displayed that.

Whereas if you have one big build script in a single Buildbot step, all the output goes into one big, flat, plain text log file.

I think the solution is to decouple the structured-logging functionality from the glorified-cron functionality that Buildbot provides. My implementation of this is build_log.py (used in Plash's build scripts), which I'll write more about later.

Breakpoints in gdb using int3

2010-05-01T12:59:00.002+01:00

Here is a useful trick I discovered recently while debugging some changes to the seccomp sandbox. To trigger a breakpoint on x86, just do:

__asm__("int3");

Then it is possible to inspect registers, memory, the call stack, etc. in gdb, without having to get gdb to set a breakpoint. The instruction triggers a SIGTRAP, so if the process is not running under gdb and has no signal handler for SIGTRAP, the process will die.

This technique seems to be fairly well known, although it's not mentioned in the gdb documentation. int3 is the instruction that gdb uses internally for setting breakpoints.

Sometimes it's easier to insert an int3 and rebuild than get gdb to set a breakpoint. For example, setting a gdb breakpoint on a line won't work in the middle of a chunk of inline assembly.

My expectations of gdb are pretty low these days. When I try to use it to debug something low level, it often doesn't work, which is why I have been motivated to hack together my own debugging tools in the past. For example, if I run gdb on the glibc dynamic linker (ld.so) on Ubuntu Hardy or Karmic, it gives:

$ gdb /lib/ld-linux-x86-64.so.2 
...
(gdb) run
Starting program: /lib/ld-linux-x86-64.so.2 
Cannot access memory at address 0x21ec88
(gdb)

So it's nice to find a case where I can get some useful information out of gdb.

Lacking Rhoticity

Observing interrupts from userland on x86

Why this happens

Documentation for IRET

Possible uses

Mitigation via virtualisation

PassMark received offer to not release rowhammer test

Passing FDs/handles between processes on Unix and Windows -- a comparison

Similarities

Differences

Implications

Can cached memory accesses do double-sided row hammering?

Some further notes

How physical addresses map to rows and banks in DRAM

Motivation: the rowhammer bug

Guessing and checking an address mapping

Memory geometry

The DRAM address mapping

Why is the mapping like that?

Checking against rowhammer_test's output

Possible further testing

L3 cache mapping on Sandy Bridge CPUs

Cache mapping

Verifying the cache mapping

Further verification

Test machine

Ivy Bridge

Thanks

The DRAM rowhammer bug is exploitable

Conditionalising C/C++ code: "#ifdef FOO" vs. "#if FOO"

Implementing fork() on the Mill CPU

OS models

Context switching

Switching costs

Zygotes

Mill portals

How to do history-sensitive merges in Git

The problem

A solution

Going faster

Cleaning up

Handling crashes on Mac OS X: ordering of Mach exceptions versus POSIX signals

Simplifying LLVM IR for PNaCl

Native Client's NTDLL patch on x86-64 Windows

The problem

The fix: patching NTDLL

Breakpad crash reporting

x86-32 Windows

Limitations

Possible alternatives

Conclusion

References

Stack unwinding risks on 64-bit Windows

ARM cache flushing & doubly-mapped pages

Fixing the trouble with Buildbot

Cookies versus the Chrome sandbox

Conclusions

A common misconception about the Chrome sandbox

When printf debugging is a luxury

An introduction to FreeBSD-Capsicum

Process descriptors in FreeBSD-Capsicum

My workflow with git-cl + Rietveld

CVS's problems resurface in Git

The trouble with Buildbot

Breakpoints in gdb using int3

Checking against `rowhammer_test`'s output