Monday 8 January 2018

Observing interrupts from userland on x86

In 2016, I noticed a quirk of the x86 architecture that leads to an interesting side channel. On x86, it is possible for a userland process to detect when it has been interrupted by an interrupt handler, without resorting to timing. This is because the usual mechanism for handling interrupts (without using virtualisation) doesn't always preserve all userland registers across an interrupt handler.

If a process sets a segment selector register such as %fs or %gs to 1, the register will get set to 0 when the process gets interrupted by an interrupt. Specifically, the x86 IRET instruction will reset the register to 0 when returning to userland -- this is the instruction that kernels use for returning from an interrupt handler.

I have not seen this quirk explicitly documented anywhere, so I thought it was worthwhile documenting it via this blog post.

The following C program demonstrates the effect:

#include <stdint.h>
#include <stdio.h>

void set_gs(uint16_t value) {
  __asm__ volatile("mov %0, %%gs" : : "r"(value));
}

uint16_t get_gs() {
  uint16_t value;
  __asm__ volatile("mov %%gs, %0" : "=r"(value));
  return value;
}

int main() {
  uint16_t orig_gs = get_gs();
  set_gs(1);
  unsigned int count = 0;
  /* Loop until %gs gets reset by an interrupt handler. */
  while (get_gs() == 1)
    ++count;
  /* Restore register so as not to break TLS on x86-32 Linux.  This is
     not necessary on x86-64 Linux, which uses %fs for TLS. */
  set_gs(orig_gs);
  printf("%%gs was reset after %u iterations\n", count);
  return 0;
}

This works on x86-32 or x86-64. I tested it on Linux. It will print a non-deterministic number of iterations. For example:

%gs was reset after 1807364 iterations

Why this happens

  1. x86 segment registers are a bit weird, because each one has two parts:

    • A program-visible 16-bit "segment selector" value, which can be read and written by the MOV instruction.
    • A hidden part. When you write to a segment register using the MOV instruction, the CPU also fills out the hidden part. The hidden part includes a field called DPL (Descriptor Privilege Level).
  2. The specification for the IRET instruction contains a rule which says that:

    • If we are switching to a less-privileged privilege level, such as from the kernel (ring 0) to userland (ring 3),
    • and if a segment register's hidden DPL field contains a value specifying a more-privileged privilege level than what we're switching to,
    • then the segment register should be reset to 0 (as if the segment selector value 0 was written to the register).
  3. The segment selector values 0, 1, 2 and 3 are special -- they are "null segment selector" values.

  4. When you write a null segment selector value to a segment register using MOV, the CPU apparently writes 0 into the segment register's hidden DPL field. (0 is the most-privileged privilege level.)

    I say "apparently" because this does not appear to be explicitly specified in the Intel or AMD docs -- and since this part of the register is hidden, it is hard to check what it contains.

So, if you set %gs = 1, the program-visible part of %gs will contain the value 1 (i.e. reading %gs will return 1), but %gs's DPL field will have been set to 0. That will cause IRET to set %gs = 0 when returning from the kernel to userland.

Documentation for IRET

Intel's documentation for the IRET instruction describes this behaviour via the following piece of pseudocode:

RETURN-TO-OUTER-PRIVILEGE-LEVEL:
...
FOR each of segment register (ES, FS, GS, and DS)
  DO
    IF segment register points to data or non-conforming code segment
    and CPL > segment descriptor DPL (* Stored in hidden part of segment register *)
      THEN (* Segment register invalid *)
        SegmentSelector ← 0; (* NULL segment selector *)
    FI;
  OD;

(From the Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 2A, dated June 2013.)

After writing the explanation above, I checked the current version of Intel's documentation (dated December 2017). I found that the pseudocode has changed slightly. It now contains an explicit "SegmentSelector == NULL" check:

RETURN-TO-OUTER-PRIVILEGE-LEVEL:
...
FOR each SegReg in (ES, FS, GS, and DS)
  DO
    tempDesc ← descriptor cache for SegReg (* hidden part of segment register *)
    IF (SegmentSelector == NULL) OR (tempDesc(DPL) < CPL AND tempDesc(Type) is (data or non-conforming code)))
      THEN (* Segment register invalid *)
        SegmentSelector ← 0; (*Segment selector becomes null*)
    FI;
  OD;

(From the Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 2A, December 2017.)

The "SegmentSelector == NULL" check in the new version is ambiguous because there are multiple selector values that are considered to be null. Does this mean "SegmentSelector == 0" or "SegmentSelector in (0,1,2,3)"? If it's the latter, this rule would apply when IRET returns to kernel mode, not just to user mode.

AMD's documentation for IRET contains similar pseudocode:

IF (changing CPL)
{
   FOR (seg = ES, DS, FS, GS)
       IF ((seg.attr.dpl < CPL) && ((seg.attr.type = ’data’)
          || (seg.attr.type = ’non-conforming-code’)))
       {
           seg = NULL  // can’t use lower dpl data segment at higher cpl
       }
}

(From the AMD64 Architecture Programmer's Manual, Volume 3, Revision 3.25, December 2017.)

Note that not all versions of QEMU implement this faithfully in their emulation of x86.

Possible uses

It is possible to detect interrupts without this using method by using a high-resolution timer (such as x86's RDTSC instruction). However:

  • Sometimes hi-res timers are disabled. Sometimes this is out of concern about side channel attacks. In particular, it is possible to disable the RDTSC instruction.
  • Even if a process uses a timer to detect delays in its execution, this doesn't tell it the cause of a delay for certain. In contrast, checking %fs/%gs will indicate whether a delay was accompanied by an IRET.

So, this technique lets us classify delays.

This could be useful for microbenchmarking. If we're repeatedly running an operation and measuring how long it takes, we can throw out the runs that got interrupted, and thereby reduce the noise in our timing data -- an alternative to using statistical techniques for removing outliers. However, this technique won't catch interrupts that are handled while executing syscalls, because an IRET that returns to kernel mode won't reset the segment registers.

This could be useful if we're trying to read from a side channel by timing memory accesses. We can throw out accesses that got interrupted. Again, this is a way of reducing noise. However, if interrupts occur infrequently, it's possible that they don't cause enough noise for the extra work of using this technique to be worthwhile.

Mitigation via virtualisation

This issue does not occur if the userland process is running in a hardware-virtualised VM and the process is interrupted by an interrupt which is handled outside the VM. I tested this using KVM on Linux. In this case, the interrupt will cause a VMEXIT to occur; the hypervisor will handle the interrupt; and the hypervisor will use the VMENTER instruction (rather than IRET) to return back to the virtualised userland process.

It appears that the pair of VMEXIT/VMENTER operations will -- unlike IRET -- save and restore all of the segment register state, including the hidden state.

Tuesday 20 October 2015

PassMark received offer to not release rowhammer test

Here's an interesting report of skulduggery related to the rowhammer bug.

PassMark say they received an offer to not release a rowhammer test in their MemTest86 tool, in return for payment:

"We had anonymous contact offering to act as a go between between us and unnamed memory companies, with a view to paying us not release the new version of MemTest86. Who knows how serious the offer was.

Needless to say we didn't take up that option, and just released the software anyway.

But the issue is a BIG issue. The lack of publicity up to now is somewhat surprising considering the implications. Many computers are fundamentally (slightly) unreliable in a random ways. Maybe this doesn't matter for home use, but for medical devices, banking systems, flight control systems, etc.. it is a big deal."

The quoted post is from 20th February 2015 – after the rowhammer bug was publicised by the CMU paper but before we published about the exploitability of the bug.

Some background: PassMark are the maintainers of MemTest86. (MemTest86 should not be confused with MemTest86+ which is an alternative, open source fork of the same original codebase.)

Sunday 24 May 2015

Passing FDs/handles between processes on Unix and Windows -- a comparison

Handles on Windows are analogous to file descriptors (FDs) on Unix, and both can be passed between processes. However, the way in which handles/FDs can be passed between processes is quite different on Unix and Windows.

In this blog post I'll explain the difference. You might find this useful if you are familiar with systems programming on either Unix or Windows but not both.

Similarities

I'll first explain what's the same on Unix and Windows. Both OSes have a distinction between FD/handle numbers and FD/handle objects.

On both Windows and Unix, each process has its own FD/handle table which maps from FD/handle numbers to FD/handle objects:

  • FD numbers are indexes into the FD table. On Unix, an FD number is an int.

    Windows uses the HANDLE type for handle numbers. Though HANDLE is typedef'd to void *, a HANDLE is really just a 32-bit index. (Windows does not use a HANDLE as a pointer into the process's address space.)

  • FD objects are what FD numbers map to. User code never gets to see FD objects directly: it can only manipulate them via FD numbers. Multiple FD numbers can map to the same FD object. FD objects are (mostly) reference counted.

    All of this applies to handles on Windows too.

Note: Some people use the alternative terminology that a "file description" refers to the underlying object while "file descriptor" refers to the number. I prefer to add "number" or "object" as a suffix as the way to disambiguate -- it is more explicit, and the term "file description" is not often used.

Differences

The key difference between Unix and Windows is this:

On Unix, FD objects can be sent via sockets in messages. On Windows, handle objects cannot be sent in messages; only handle numbers can.

Windows fills this gap by allowing one process to read or modify another process's handle table synchronously using the DuplicateHandle() API. Using this API involves one process dealing with another process's handle numbers.

In contrast, Unix has no equivalent to DuplicateHandle(). A Unix process's FD table is private to the process. Consequently, on Unix it is much rarer for a process to have dealings with another process's FD numbers.

On Windows, to send a handle to another process, the sender will generally call two system calls:

  • Firstly, the sender must call DuplicateHandle() to copy the handle to the destination process. This requires the sender to have a process handle for the destination process. DuplicateHandle() will return a handle number indexing into the destination process's handle table.

  • Secondly, the sender must communicate the handle number to the destination process, e.g. by sending a message containing the number via a pipe using WriteFile().

On Unix, to send a handle to another process, the sender will call just one system call, sendmsg(), for sending a message via a Unix domain socket:

  • The sender's call to sendmsg() specifies an FD number to send via the cmsg/SCM_RIGHTS interface. The kernel translates this to an FD object, and copies the reference to this object into the socket's in-kernel message buffer.

  • The receiver can call recvmsg() to receive the message. recvmsg() will remove the FD object from the socket's message buffer and add the FD object to the calling process's FD table, allocating a new FD number within that table.

Implications

  • Cycles and GC: On Unix, it is possible to create a reference cycle from a socket to itself, via the socket's own message buffer. This means it is possible that a socket FD object is only referenced by its own message buffer. Reference counting alone wouldn't be enough to free an unreachable socket such as that. Linux therefore has an in-kernel garbage collector (GC) that handles freeing unused socket FD objects when there are reference cycles.

    In contrast, Windows doesn't need a GC for handling cycles between handles.

  • Process handles: Windows has a notion of process handles, which are used by the DuplicateHandle() API. Unix does not have an equivalent concept of process FDs as standard, although FreeBSD has process FDs (added as part of the Capsicum API).

  • Sandboxing: It is easier to handle sandboxed processes with the Unix model than with the Windows model.

    On Unix, it is easy for mutually distrusting processes to exchange FDs, because the kernel handles translating FD numbers between processes.

    On Windows, if two processes want to exchange handles, one will generally have a process handle for the other process, which gives the first process control over the latter process. An alternative is to use a broker process for exchanging handles. Chrome's Windows sandbox implements this for exchanging handles between sandboxed processes (see BrokerDuplicateHandle() in content/common/sandbox_win.cc).

  • Namespace hazard: Windows' handle-passing model creates a potential hazard: If DuplicateHandle() gives you a HANDLE that's an index into another process's handle table, you must be careful not to treat it as an index into your process's handle table. For example, don't call CloseHandle() on it, otherwise you might close an unrelated handle.

    It is much rarer for this hazard to arise on Unix.

  • Unix emulation: The difference between the FD/handle-passing models makes it tricky to emulate Unix domain sockets on Windows.

    An accurate emulation would need to implement the semantics of how an FD object can be stored in the message buffer of a socket that might be read from concurrently by multiple processes.

    Cygwin implements an emulation of Unix domain sockets on Windows, but it currently does not implement FD-passing.

(Updated on 2015/07/03 [previous version]: Corrected the post to say that Cygwin does not implement FD-passing.)

Monday 11 May 2015

Can cached memory accesses do double-sided row hammering?

There are indications that it is possible to cause bit flips in memory by row hammering without using CLFLUSH, using normal cached memory accesses.

This makes me wonder: Is it possible to do double-sided row hammering using cached memory accesses, or only single-sided row hammering? The former is more likely to cause bit flips, and might be the only way to cause bit flips on some machines, such as those using a 2x refresh rate -- i.e. those configured to refresh DRAM every 32ms instead of every 64ms. (See the rowhammer blog post for more background.)

The answer appears to be "no" -- at least on my test machine.

For this machine (which has a Sandy Bridge CPU), I figured out how physical addresses map to cache sets and to banks and rows in DRAM. We can use these mappings to answer questions about what kinds of row hammering are possible using cached memory accesses.

More specifically, my question is this: For a machine with an N-way L3 cache, is it possible to pick N+1 addresses that map to the same cache set, where at least two of these addresses map to rows R-1 and R+1 in one bank (for some neighbouring row R)? If so, repeatedly accessing these addresses would cause cache misses that cause rows R-1 and R+1 to be repeatedly activated. That puts more stress on row R (the victim row) than repeatedly activating only row R-1 or row R+1.

The answer to this is "no": It's not possible to pick two such physical addresses.

Here's why:

Suppose we have two such addresses, A and B. Then:

  • The addresses map to the same bank, so:
    (1): A[14:17] ^ A[18:21] = B[14:17] ^ B[18:21]
    (using the bank/row XOR scheme I described previously)

  • The addresses are 2 rows apart, so:
    (2): A[18:32] + 2 = B[18:32]

  • The addresses map to the same cache set, so:
    (3): A[6:17] = B[6:17]
    (also, SliceHash(A[17:32]) = SliceHash(B[17:32]), but we don't need this property)

(2) implies that A[19] = ~B[19].

(3) implies that A[14:17] = B[14:17]. Combining that with (1) gives A[18:21] = B[18:21]. That implies A[19] = B[19], which contradicts (2), so the constraints can't be satisfied.

Basically, the bank/row XOR scheme used by the memory controller's address mapping gets in the way.

It looks like an attacker who is trying to do row hammering using only cached memory accesses (e.g. from Javascript) would, on this particular machine, have to try one of two routes:

  • Do single-sided row hammering only.
  • Attempt double-sided hammering by picking two sets of 13 addresses (given a 12-way L3 cache), mapping to two different cache sets. This would likely halve the rate of memory accesses (relative to accessing one set of 13 addresses).

Some further notes

I might be getting ahead of myself by doing this analysis, because the feasibility of causing rowhammer-induced bit flips via cached accesses has not yet been demonstrated on a wide selection of machines.

Also, I am asking whether double-sided hammering is possible, not practical, given the constraint that the attacker's memory accesses must go through the cache. While my analysis shows that the risk of causing double-sided hammering is zero with this constraint, the risk might only be negligible without this constraint. Even if a program can use CLFLUSH to bypass the cache, it doesn't necessarily have a way to determine which pairs of addresses map to rows that are spaced apart by two (see the rowhammer blog post for further discussion). Such a program might only be able to get such pairs by chance, by selecting addresses randomly, which would take a long time.

Lastly, my analysis makes assumptions about a machine's DRAM and L3 cache mappings. It would need to be redone for other machines where these assumptions aren't true.

Monday 4 May 2015

How physical addresses map to rows and banks in DRAM

In my previous blog post, I discussed how Intel Sandy Bridge CPUs map physical addresses to locations in the L3 cache.

Now I'll discuss how these CPUs' memory controllers map physical addresses to locations in DRAM -- specifically, to row, bank and column numbers in DRAM modules. Let's call this the DRAM address mapping. I'll use one test machine as a case study.

Motivation: the rowhammer bug

I am interested in the DRAM address mapping because it is relevant to the "rowhammer" bug.

Rowhammer is a problem with some DRAM modules whereby certain pessimal memory access patterns can cause memory corruption. In these DRAMs, repeatedly activating a row of memory (termed "row hammering") can produce electrical disturbances that produce bit flips in vulnerable cells in adjacent rows of memory.

These repeated row activations can be caused by repeatedly accessing a pair of DRAM locations that are in different rows of the same bank of DRAM. Knowing the DRAM address mapping is useful because it tells us which pairs of addresses satisfy this "same bank, different row" (SBDR) property.

Guessing and checking an address mapping

For my case study, I have a test machine containing DRAM that is vulnerable to the rowhammer problem. Running rowhammer_test on this machine demonstrates bit flips.

I'd like to know what the DRAM address mapping is for this machine, but apparently it isn't publicly documented: This machine has a Sandy Bridge CPU, but Intel don't document the address mapping used by these CPUs' memory controllers.

rowhammer_test does not actually need to identify SBDR address pairs. rowhammer_test just repeatedly tries hammering randomly chosen address pairs. Typically 1/8 or 1/16 of these pairs will be SBDR pairs, because our machine has 8 banks per DIMM (and 16 banks in total). So, while we don't need to know the DRAM address mapping to cause bit flips on this machine, knowing it would help us be more targeted in our testing.

Though the address mapping isn't documented, I found that I can make an educated guess at what the mapping is, based on the DRAM's geometry, and then verify the guess based on the physical addresses that rowhammer_test reports. rowhammer_test can report the physical addresses where bit flips occur ("victims") and the pairs of physical addresses that produce those bit flips ("aggressors"). Since these pairs must be SBDR pairs, we can check a hypothesised address mapping against this empirical data.

Memory geometry

The first step in hypothesising an address mapping for a machine is to check how many DIMMs the machine has and how these DIMMs are organised internally.

I can query information about the DIMMs using the decode-dimms tool on Linux. (In Ubuntu, decode-dimms is in the i2c-tools package.) This tool decodes the DIMMs' SPD (Serial Presence Detect) metadata.

My test machine has 2 * 4GB SO-DIMMs, giving 8GB of memory in total.

decode-dimms reports the following information for both of the DIMMs:

Size                                            4096 MB
Banks x Rows x Columns x Bits                   8 x 15 x 10 x 64
Ranks                                           2

This means that, for each DIMM:

  • Each of the DIMM's banks contains 2^15 rows (32768 rows).
  • Each row contains 2^10 * 64 bits = 2^16 bits = 2^13 bytes = 8 kbytes.

Each DIMM has 2 ranks and 8 banks. Cross checking the capacity of the DIMM gives us the reported size, as expected:

8 kbytes per row * 32768 rows * 2 ranks * 8 banks = 4096 MB = 4 GB

The DRAM address mapping

On my test machine, it appears that the bits of physical addresses are used as follows:

  • Bits 0-5: These are the lower 6 bits of the byte index within a row (i.e. the 6-bit index into a 64-byte cache line).
  • Bit 6: This is a 1-bit channel number, which selects between the 2 DIMMs.
  • Bits 7-13: These are the upper 7 bits of the index within a row (i.e. the upper bits of the column number).
  • Bits 14-16: These are XOR'd with the bottom 3 bits of the row number to give the 3-bit bank number.
  • Bit 17: This is a 1-bit rank number, which selects between the 2 ranks of a DIMM (which are typically the two sides of the DIMM's circuit board).
  • Bits 18-32: These are the 15-bit row number.
  • Bits 33+: These may be set because physical memory starts at physical addresses greater than 0.

Why is the mapping like that?

This mapping fits the results from rowhammer_test (see below), but we can also explain that the address bits are mapped this way to give good performance for typical memory access patterns, such as sequential accesses and strided accesses:

  • Channel parallelism: Placing the channel number at bit 6 means that cache lines will alternate between the two channels (i.e. the two DIMMs), which can be accessed in parallel. This means that if we're accessing addresses sequentially, the load will be spread across the two channels.

    As an aside, Ivy Bridge (the successor to Sandy Bridge) apparently makes the mapping of the channel number more complex. An Intel presentation mentions "Channel hashing" and says that this "Allows channel selection to be made based on multiple address bits. Historically, it had been "A[6]". Allows more even distribution of memory accesses across channels."

  • Bank thrashing: Generally, column, bank and row numbers are arranged to minimise "bank thrashing" (frequently changing a bank's currently activated row).

    Some background: DRAM modules are organised into banks, which in turn are organised into rows. Each bank has a "currently activated row" whose contents are copied into a row buffer which acts as a cache that can be accessed quickly. Accessing a different row takes longer because that row must be activated first. So, the DRAM address mapping places SBDR pairs as far apart as possible in physical address space.

    Row hammering is a special case of bank thrashing where two particular rows are repeatedly activated (perhaps deliberately).

  • Bank parallelism: Banks can be accessed in parallel (though to a lesser degree than channels), so the bank number changes before the row number as the address is increased.

  • XOR scheme: XORing the row number's lower bits into the bank number is a trick to avoid bank thrashing when accessing arrays with large strides. For example, in the mapping above, the XORing causes addresses X and X+256k to be placed in different banks instead of being an SBDR pair.

    Bank/row XORing schemes are described in various places, such as:

Checking against rowhammer_test's output

I ran rowhammer_test_ext (the extended version of rowhammer_test) on my test machine for 6 hours, and it found repeatable bit flips at 22 locations. (See the raw data and analysis code.)

The row hammering test generates a set of (A1, A2, V) triples, where:

  • V is the victim address where we see the bit flip.
  • A1 and A2 are the aggressor addresses that we hammer.
  • We sort A1 and A2 so that A1 is closer to V than A2 is. We tentatively assume that the closer address, A1, is the one that actually causes the bit flips (though this wouldn't necessarily be true if the DRAM address mapping were more complicated).

There are three properties we expect to hold for all of these results:

  • Row: A1 and V's row numbers should differ by 1 -- i.e. they should be in adjacent rows. (A2 can have any row number.)

    This property makes it easy to work out where the bottom bits of the row number are in the physical address.

    We find this property holds for all but 2 of the results. In those 2 results, the row numbers differ by 3 rather than 1.

  • Bank: V, A1 and A2 should have the same bank number. Indeed, we find this property holds for all 22 results. This only holds when applying the row/bank XORing scheme.

  • Channel: V, A1 and A2 should have the same channel number. This holds for all the results. It happens that all of our results have channel=0, because rowhammer_test only selects 4k-aligned addresses and so only tests one channel. (Maybe this could be considered a bug.)

Possible further testing

There are two further experiments we could run to check whether the DRAM address mapping evaluates the SBDR property correctly, which I haven't tried yet:

  • Timing tests: Accessing SBDR address pairs repeatedly should be slower than accessing non-SBDR pairs repeatedly, because the former cause row activations and the latter don't.

  • Exhaustive rowhammer testing: Once we've found an aggressor address, A1, that causes a repeatable bit flip, we can test this against many values of address A2. Hammering (A1, A2) can produce bit flips only if this is an SBDR pair.

Furthermore, taking out one DIMM from our test machine should remove the channel bit from the DRAM address mapping and change the aggressor/victim addresses accordingly. We could check whether this is the case.

Monday 27 April 2015

L3 cache mapping on Sandy Bridge CPUs

In 2013, some researchers reverse-engineered how Intel Sandy Bridge CPUs map physical addresses to cache sets in the L3 cache (the last-level cache). They were interested in the cache mapping because it can be used to defeat kernel ASLR. I'm interested because the cache mapping can be used to test whether cached memory accesses can do row hammering (which can cause exploitable bit flips in some DRAM devices).

The researchers published the details in the paper "Practical Timing Side Channel Attacks Against Kernel Space ASLR" (Ralf Hund, Carsten Willems and Thorsten Holz).

They only published the mapping for 4-core CPUs, but I have figured out the mapping for 2-core CPUs as well.

Some background:

  • On Sandy Bridge CPUs, the L3 cache is divided into slices. Physical addresses are hashed to determine which slice of the L3 cache they will be stored in.

  • The L3 cache is distributed and ring-based. There is one slice per core, but all the cores in a CPU can access all the cache slices via a ring bus which connects all the cores and their caches together.

  • When a core accesses a memory location, the location will be slightly slower to access if it maps to a different core's cache slice, because it would take one or two hops around the ring bus to access it. The protocol used on the ring bus is based on QPI (Intel's QuickPath Interconnect). (QPI is a protocol used for connecting multiple CPUs together on high-end multi-socket systems.)

  • Each cache slice contains 2048 cache sets.

  • On lower-end CPUs, cache sets are 12-way associative, so a cache slice is 1.5MB in size (2048 sets * 12 ways * 64 bytes per cache line = 1.5MB).

  • On higher-end CPUs, cache sets are 16-way associative, so a cache slice is 2MB in size (2048 sets * 16 ways * 64 bytes per cache line = 2MB).

Cache mapping

The researchers (Hund et al) figured out that the L3 cache uses the bits of a physical address as follows:

  • Bits 0-5: These give the 6-bit byte offset within a 64-byte cache line.
  • Bits 6-16: These give the 11-bit number of the cache set within a cache slice.
  • Bits 17-31: These are hashed to determine which cache slice to use (see below).
  • Bits 32+: Not used.

The hash function used to choose the cache slice is as follows:

  • On a 4-core CPU, there are 4 cache slices, so the slice number is 2-bit. The two bits of the slice number are h1 and h2, where:

    • h1 is the XOR of physical address bits 18, 19, 21, 23, 25, 27, 29, 30, 31.
    • h2 is the XOR of physical address bits 17, 19, 20, 21, 22, 23, 24, 26, 28, 29, 31.
  • On a 2-core CPU, there are 2 cache slices, so the slice number is 1-bit. The slice number is the XOR of physical address bits 17, 18, 20, 22, 24, 25, 26, 27, 28, 30.

    Notice that this is equal to the XOR of h1 and h2. (Bits 19, 21, 23, 29 and 31 cancel out when XORing h1 and h2.) This is the part I figured out.

Verifying the cache mapping

I figured out the 2-core hash function by making an educated guess and verifying it.

I verified the guess by writing a C++ program which:

  • Picks N physical memory addresses that should map to the same cache set, according to the cache mapping that we guessed.

    We use Linux's /proc/PID/pagemap interface to determine which physical addresses we have access to.

  • Measures the time it takes to access the N addresses. Specifically, the program accesses the first N-1 addresses and then times accessing the Nth address.

The program does that for multiple values of N.

If we guessed the cache mapping correctly, then, on a CPU with a 12-way cache, we should see the memory access time spike upwards at N=13. This is because, at N=13, the memory locations we're accessing no longer fit within the 12-way cache set, and we'll get L3 cache misses. The memory access time will go up from the L3 cache's latency to DRAM's latency.

That increase is indeed what we see:

(Note: This also assumes that the cache uses an LRU or Pseudo-LRU eviction policy, which is what Sandy Bridge uses. However, the cache eviction policy changed in Ivy Bridge, which is the successor to Sandy Bridge -- see below.)

On the other hand, if we guessed the cache mapping incorrectly, the memory access time will go upwards gradually at higher values of N. Specifically, on a 2-cache-slice CPU, if we get the address-to-slice hash function wrong, we'll see the access time reach DRAM latency at N=13*2, on average. That's because our N physical addresses will be distributed across 2 slices, so on average it will take 13*2 addresses before the 2 cache sets in the slices overflow and produce cache misses.

That's what we get if we wrongly guess that the address-to-slice hash function is h1 (which was my first guess):

Further verification

What happens if our address-to-slice hash function is mostly right, but not completely right? What happens if one of the bits we're XORing is wrong? (For example, maybe bit 17 should really be left out or bit 19 should really be XOR'd too.) Would we notice? Yes, we can test for that.

For each bit from 17 to 31, we can try inverting whether that bit is included in the set of bits that our address-to-slice hash function XORs together.

Here's the graph we get for the resulting hash functions:

Our original hash function (in bold) shows the latency going up soonest, at N=13, so it is likely to be the correct hash function.

Test machine

I ran my tests on a machine with a 2-core, 4-hyperthread Sandy Bridge CPU that reports the following in /proc/cpuinfo on Linux:

cpu family      : 6
model           : 42
model name      : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
stepping        : 7
cache size      : 3072 KB
cpu cores       : 2

Ivy Bridge

This L3 cache mapping appears to apply to Ivy Bridge CPUs too. I ran the same tests on a machine with an Ivy Bridge CPU (also 2-core, 4-hyperthread) and initially got the same shape graphs.

However, the results weren't consistently reproducible on this machine. Later runs showed higher memory access times for N<=12.

This is consistent with reports that Ivy Bridge's L3 cache uses DIP (Dynamic Insertion Policy) for its cache eviction policy, in order to protect against cache thrashing. DIP switches dynamically between an LRU policy (better for smaller working sets which fit in the cache) and a BIP policy (better for larger working sets which don't fit in the cache). For N>12, my test probably generates enough cache misses to cause the cache to switch to BIP mode. Note that this means the order in which we test values of N can affect the results we get.

This machine's /proc/cpuinfo reports:

cpu family      : 6
model           : 58
model name      : Intel(R) Core(TM) i5-3427U CPU @ 1.80GHz
stepping        : 9
cache size      : 3072 KB
cpu cores       : 2

See Henry Wong's blog post, "Intel Ivy Bridge Cache Replacement Policy", which compares Sandy Bridge and Ivy Bridge.

Also see the paper "Adaptive Insertion Policies for High Performance Caching" (Moinuddin K. Qureshi, Aamer Jaleel, Yale N. Patt, Simon C. Steely Jr., Joel Emer), which describes DIP (Dynamic Insertion Policy).

Thanks

Thanks to Yossef Oren for pointing me to the paper by Hund et al, which is referenced by the paper he coauthored, "The Spy in the Sandbox -- Practical Cache Attacks in Javascript" (Yossef Oren, Vasileios P. Kemerlis, Simha Sethumadhavan, Angelos D. Keromytis).

Tuesday 10 March 2015

The DRAM rowhammer bug is exploitable

I've been researching the DRAM rowhammer issue and its security implications for a while. We've finally published our findings on the Project Zero blog: Exploiting the DRAM rowhammer bug to gain kernel privileges.