Monday 8 January 2018

Observing interrupts from userland on x86

In 2016, I noticed a quirk of the x86 architecture that leads to an interesting side channel. On x86, it is possible for a userland process to detect when it has been interrupted by an interrupt handler, without resorting to timing. This is because the usual mechanism for handling interrupts (without using virtualisation) doesn't always preserve all userland registers across an interrupt handler.

If a process sets a segment selector register such as %fs or %gs to 1, the register will get set to 0 when the process gets interrupted by an interrupt. Specifically, the x86 IRET instruction will reset the register to 0 when returning to userland -- this is the instruction that kernels use for returning from an interrupt handler.

I have not seen this quirk explicitly documented anywhere, so I thought it was worthwhile documenting it via this blog post.

The following C program demonstrates the effect:

#include <stdint.h>
#include <stdio.h>

void set_gs(uint16_t value) {
  __asm__ volatile("mov %0, %%gs" : : "r"(value));
}

uint16_t get_gs() {
  uint16_t value;
  __asm__ volatile("mov %%gs, %0" : "=r"(value));
  return value;
}

int main() {
  uint16_t orig_gs = get_gs();
  set_gs(1);
  unsigned int count = 0;
  /* Loop until %gs gets reset by an interrupt handler. */
  while (get_gs() == 1)
    ++count;
  /* Restore register so as not to break TLS on x86-32 Linux.  This is
     not necessary on x86-64 Linux, which uses %fs for TLS. */
  set_gs(orig_gs);
  printf("%%gs was reset after %u iterations\n", count);
  return 0;
}

This works on x86-32 or x86-64. I tested it on Linux. It will print a non-deterministic number of iterations. For example:

%gs was reset after 1807364 iterations

Why this happens

  1. x86 segment registers are a bit weird, because each one has two parts:

    • A program-visible 16-bit "segment selector" value, which can be read and written by the MOV instruction.
    • A hidden part. When you write to a segment register using the MOV instruction, the CPU also fills out the hidden part. The hidden part includes a field called DPL (Descriptor Privilege Level).
  2. The specification for the IRET instruction contains a rule which says that:

    • If we are switching to a less-privileged privilege level, such as from the kernel (ring 0) to userland (ring 3),
    • and if a segment register's hidden DPL field contains a value specifying a more-privileged privilege level than what we're switching to,
    • then the segment register should be reset to 0 (as if the segment selector value 0 was written to the register).
  3. The segment selector values 0, 1, 2 and 3 are special -- they are "null segment selector" values.

  4. When you write a null segment selector value to a segment register using MOV, the CPU apparently writes 0 into the segment register's hidden DPL field. (0 is the most-privileged privilege level.)

    I say "apparently" because this does not appear to be explicitly specified in the Intel or AMD docs -- and since this part of the register is hidden, it is hard to check what it contains.

So, if you set %gs = 1, the program-visible part of %gs will contain the value 1 (i.e. reading %gs will return 1), but %gs's DPL field will have been set to 0. That will cause IRET to set %gs = 0 when returning from the kernel to userland.

Documentation for IRET

Intel's documentation for the IRET instruction describes this behaviour via the following piece of pseudocode:

RETURN-TO-OUTER-PRIVILEGE-LEVEL:
...
FOR each of segment register (ES, FS, GS, and DS)
  DO
    IF segment register points to data or non-conforming code segment
    and CPL > segment descriptor DPL (* Stored in hidden part of segment register *)
      THEN (* Segment register invalid *)
        SegmentSelector ← 0; (* NULL segment selector *)
    FI;
  OD;

(From the Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 2A, dated June 2013.)

After writing the explanation above, I checked the current version of Intel's documentation (dated December 2017). I found that the pseudocode has changed slightly. It now contains an explicit "SegmentSelector == NULL" check:

RETURN-TO-OUTER-PRIVILEGE-LEVEL:
...
FOR each SegReg in (ES, FS, GS, and DS)
  DO
    tempDesc ← descriptor cache for SegReg (* hidden part of segment register *)
    IF (SegmentSelector == NULL) OR (tempDesc(DPL) < CPL AND tempDesc(Type) is (data or non-conforming code)))
      THEN (* Segment register invalid *)
        SegmentSelector ← 0; (*Segment selector becomes null*)
    FI;
  OD;

(From the Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 2A, December 2017.)

The "SegmentSelector == NULL" check in the new version is ambiguous because there are multiple selector values that are considered to be null. Does this mean "SegmentSelector == 0" or "SegmentSelector in (0,1,2,3)"? If it's the latter, this rule would apply when IRET returns to kernel mode, not just to user mode.

AMD's documentation for IRET contains similar pseudocode:

IF (changing CPL)
{
   FOR (seg = ES, DS, FS, GS)
       IF ((seg.attr.dpl < CPL) && ((seg.attr.type = ’data’)
          || (seg.attr.type = ’non-conforming-code’)))
       {
           seg = NULL  // can’t use lower dpl data segment at higher cpl
       }
}

(From the AMD64 Architecture Programmer's Manual, Volume 3, Revision 3.25, December 2017.)

Note that not all versions of QEMU implement this faithfully in their emulation of x86.

Possible uses

It is possible to detect interrupts without this using method by using a high-resolution timer (such as x86's RDTSC instruction). However:

  • Sometimes hi-res timers are disabled. Sometimes this is out of concern about side channel attacks. In particular, it is possible to disable the RDTSC instruction.
  • Even if a process uses a timer to detect delays in its execution, this doesn't tell it the cause of a delay for certain. In contrast, checking %fs/%gs will indicate whether a delay was accompanied by an IRET.

So, this technique lets us classify delays.

This could be useful for microbenchmarking. If we're repeatedly running an operation and measuring how long it takes, we can throw out the runs that got interrupted, and thereby reduce the noise in our timing data -- an alternative to using statistical techniques for removing outliers. However, this technique won't catch interrupts that are handled while executing syscalls, because an IRET that returns to kernel mode won't reset the segment registers.

This could be useful if we're trying to read from a side channel by timing memory accesses. We can throw out accesses that got interrupted. Again, this is a way of reducing noise. However, if interrupts occur infrequently, it's possible that they don't cause enough noise for the extra work of using this technique to be worthwhile.

Mitigation via virtualisation

This issue does not occur if the userland process is running in a hardware-virtualised VM and the process is interrupted by an interrupt which is handled outside the VM. I tested this using KVM on Linux. In this case, the interrupt will cause a VMEXIT to occur; the hypervisor will handle the interrupt; and the hypervisor will use the VMENTER instruction (rather than IRET) to return back to the virtualised userland process.

It appears that the pair of VMEXIT/VMENTER operations will -- unlike IRET -- save and restore all of the segment register state, including the hidden state.