Monday, 27 April 2015

L3 cache mapping on Sandy Bridge CPUs

In 2013, some researchers reverse-engineered how Intel Sandy Bridge CPUs map physical addresses to cache sets in the L3 cache (the last-level cache). They were interested in the cache mapping because it can be used to defeat kernel ASLR. I'm interested because the cache mapping can be used to test whether cached memory accesses can do row hammering (which can cause exploitable bit flips in some DRAM devices).

The researchers published the details in the paper "Practical Timing Side Channel Attacks Against Kernel Space ASLR" (Ralf Hund, Carsten Willems and Thorsten Holz).

They only published the mapping for 4-core CPUs, but I have figured out the mapping for 2-core CPUs as well.

Some background:

  • On Sandy Bridge CPUs, the L3 cache is divided into slices. Physical addresses are hashed to determine which slice of the L3 cache they will be stored in.

  • The L3 cache is distributed and ring-based. There is one slice per core, but all the cores in a CPU can access all the cache slices via a ring bus which connects all the cores and their caches together.

  • When a core accesses a memory location, the location will be slightly slower to access if it maps to a different core's cache slice, because it would take one or two hops around the ring bus to access it. The protocol used on the ring bus is based on QPI (Intel's QuickPath Interconnect). (QPI is a protocol used for connecting multiple CPUs together on high-end multi-socket systems.)

  • Each cache slice contains 2048 cache sets.

  • On lower-end CPUs, cache sets are 12-way associative, so a cache slice is 1.5MB in size (2048 sets * 12 ways * 64 bytes per cache line = 1.5MB).

  • On higher-end CPUs, cache sets are 16-way associative, so a cache slice is 2MB in size (2048 sets * 16 ways * 64 bytes per cache line = 2MB).

Cache mapping

The researchers (Hund et al) figured out that the L3 cache uses the bits of a physical address as follows:

  • Bits 0-5: These give the 6-bit byte offset within a 64-byte cache line.
  • Bits 6-16: These give the 11-bit number of the cache set within a cache slice.
  • Bits 17-31: These are hashed to determine which cache slice to use (see below).
  • Bits 32+: Not used.

The hash function used to choose the cache slice is as follows:

  • On a 4-core CPU, there are 4 cache slices, so the slice number is 2-bit. The two bits of the slice number are h1 and h2, where:

    • h1 is the XOR of physical address bits 18, 19, 21, 23, 25, 27, 29, 30, 31.
    • h2 is the XOR of physical address bits 17, 19, 20, 21, 22, 23, 24, 26, 28, 29, 31.
  • On a 2-core CPU, there are 2 cache slices, so the slice number is 1-bit. The slice number is the XOR of physical address bits 17, 18, 20, 22, 24, 25, 26, 27, 28, 30.

    Notice that this is equal to the XOR of h1 and h2. (Bits 19, 21, 23, 29 and 31 cancel out when XORing h1 and h2.) This is the part I figured out.

Verifying the cache mapping

I figured out the 2-core hash function by making an educated guess and verifying it.

I verified the guess by writing a C++ program which:

  • Picks N physical memory addresses that should map to the same cache set, according to the cache mapping that we guessed.

    We use Linux's /proc/PID/pagemap interface to determine which physical addresses we have access to.

  • Measures the time it takes to access the N addresses. Specifically, the program accesses the first N-1 addresses and then times accessing the Nth address.

The program does that for multiple values of N.

If we guessed the cache mapping correctly, then, on a CPU with a 12-way cache, we should see the memory access time spike upwards at N=13. This is because, at N=13, the memory locations we're accessing no longer fit within the 12-way cache set, and we'll get L3 cache misses. The memory access time will go up from the L3 cache's latency to DRAM's latency.

That increase is indeed what we see:

(Note: This also assumes that the cache uses an LRU or Pseudo-LRU eviction policy, which is what Sandy Bridge uses. However, the cache eviction policy changed in Ivy Bridge, which is the successor to Sandy Bridge -- see below.)

On the other hand, if we guessed the cache mapping incorrectly, the memory access time will go upwards gradually at higher values of N. Specifically, on a 2-cache-slice CPU, if we get the address-to-slice hash function wrong, we'll see the access time reach DRAM latency at N=13*2, on average. That's because our N physical addresses will be distributed across 2 slices, so on average it will take 13*2 addresses before the 2 cache sets in the slices overflow and produce cache misses.

That's what we get if we wrongly guess that the address-to-slice hash function is h1 (which was my first guess):

Further verification

What happens if our address-to-slice hash function is mostly right, but not completely right? What happens if one of the bits we're XORing is wrong? (For example, maybe bit 17 should really be left out or bit 19 should really be XOR'd too.) Would we notice? Yes, we can test for that.

For each bit from 17 to 31, we can try inverting whether that bit is included in the set of bits that our address-to-slice hash function XORs together.

Here's the graph we get for the resulting hash functions:

Our original hash function (in bold) shows the latency going up soonest, at N=13, so it is likely to be the correct hash function.

Test machine

I ran my tests on a machine with a 2-core, 4-hyperthread Sandy Bridge CPU that reports the following in /proc/cpuinfo on Linux:

cpu family      : 6
model           : 42
model name      : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
stepping        : 7
cache size      : 3072 KB
cpu cores       : 2

Ivy Bridge

This L3 cache mapping appears to apply to Ivy Bridge CPUs too. I ran the same tests on a machine with an Ivy Bridge CPU (also 2-core, 4-hyperthread) and initially got the same shape graphs.

However, the results weren't consistently reproducible on this machine. Later runs showed higher memory access times for N<=12.

This is consistent with reports that Ivy Bridge's L3 cache uses DIP (Dynamic Insertion Policy) for its cache eviction policy, in order to protect against cache thrashing. DIP switches dynamically between an LRU policy (better for smaller working sets which fit in the cache) and a BIP policy (better for larger working sets which don't fit in the cache). For N>12, my test probably generates enough cache misses to cause the cache to switch to BIP mode. Note that this means the order in which we test values of N can affect the results we get.

This machine's /proc/cpuinfo reports:

cpu family      : 6
model           : 58
model name      : Intel(R) Core(TM) i5-3427U CPU @ 1.80GHz
stepping        : 9
cache size      : 3072 KB
cpu cores       : 2

See Henry Wong's blog post, "Intel Ivy Bridge Cache Replacement Policy", which compares Sandy Bridge and Ivy Bridge.

Also see the paper "Adaptive Insertion Policies for High Performance Caching" (Moinuddin K. Qureshi, Aamer Jaleel, Yale N. Patt, Simon C. Steely Jr., Joel Emer), which describes DIP (Dynamic Insertion Policy).


Thanks to Yossef Oren for pointing me to the paper by Hund et al, which is referenced by the paper he coauthored, "The Spy in the Sandbox -- Practical Cache Attacks in Javascript" (Yossef Oren, Vasileios P. Kemerlis, Simha Sethumadhavan, Angelos D. Keromytis).


Mark Seaborn said...

Note that another term for this distributed ring-based cache appears to be "Non-Uniform Cache Architecture" (NUCA), presumably by analogy with Non-Uniform Memory Access/Architecture (NUMA).

The term is mentioned in this 2011 blog post: "The Debate over Shared and Private Caches" (Utah Arch blog).

I've not seen the term NUCA used in connection with Sandy/Ivy Bridge, but that's presumably just because the details of these CPUs' cache architectures and their research origins don't get discussed very often.

Mark Seaborn said...

It looks like bit 32 should be included in the set of physical address bits that are XOR'd together. Also, there was a bias in the test program I linked to. For more info, see this post.