What is a retpoline and how does it work?
Asked Answered
C

3

282

In order to mitigate against kernel or cross-process memory disclosure (the Spectre attack), the Linux kernel1 will be compiled with a new option, -mindirect-branch=thunk-extern introduced to gcc to perform indirect calls through a so-called retpoline.

This appears to be a newly invented term as a Google search turns up only very recent use (generally all in 2018).

What is a retpoline and how does it prevent the recent kernel information disclosure attacks?


1 It's not Linux specific, however - similar or identical construct seems to be used as part of the mitigation strategies on other OSes.

Cataclinal answered 4/1, 2018 at 5:52 Comment(5)
An interesting support article from Google.Hydrofoil
oh, so it's pronounced /ˌtræmpəˈlin/ (American) or /ˈtræmpəˌliːn/ (British)Piccard
You might mention that this is the Linux kernel, though gcc points that way! I did not recognise lkml.org/lkml/2018/1/3/780 as on the Linux Kernel Mailing List site, not even once I looked there (and was served a snapshot as it was offline).Gerta
@Gerta - added a Linux kernel tagAntimacassar
@Gerta - good point, I updated the question text. Note that I saw it first in the Linux kernel because of it's relatively open development process, but no doubt the same or similar techniques are being uses as mitigations across the spectrum of open and closed source OSes. So I don't see this as Linux-specific, but the link certainly is.Cataclinal
S
184

The article mentioned by sgbj in the comments written by Google's Paul Turner explains the following in much more detail, but I'll give it a shot:

As far as I can piece this together from the limited information at the moment, a retpoline is a return trampoline that uses an infinite loop that is never executed to prevent the CPU from speculating on the target of an indirect jump.

The basic approach can be seen in Andi Kleen's kernel branch addressing this issue:

It introduces the new __x86.indirect_thunk call that loads the call target whose memory address (which I'll call ADDR) is stored on top of the stack and executes the jump using a the RET instruction. The thunk itself is then called using the NOSPEC_JMP/CALL macro, which was used to replace many (if not all) indirect calls and jumps. The macro simply places the call target on the stack and sets the return address correctly, if necessary (note the non-linear control flow):

.macro NOSPEC_CALL target
    jmp     1221f            /* jumps to the end of the macro */
1222:
    push    \target          /* pushes ADDR to the stack */
    jmp __x86.indirect_thunk /* executes the indirect jump */
1221:
    call    1222b            /* pushes the return address to the stack */
.endm

The placement of call in the end is necessary so that when the indirect call is finished, the control flow continues behind the use of the NOSPEC_CALL macro, so it can be used in place of a regular call

The thunk itself looks as follows:

    call retpoline_call_target
2:
    lfence /* stop speculation */
    jmp 2b
retpoline_call_target:
    lea 8(%rsp), %rsp 
    ret

The control flow can get a bit confusing here, so let me clarify:

  • call pushes the current instruction pointer (label 2) to the stack.
  • lea adds 8 to the stack pointer, effectively discarding the most recently pushed quadword, which is the last return address (to label 2). After this, the top of the stack points at the real return address ADDR again.
  • ret jumps to *ADDR and resets the stack pointer to the beginning of the call stack.

In the end, this whole behaviour is practically equivalent to jumping directly to *ADDR. The one benefit we get is that the branch predictor used for return statements (Return Stack Buffer, RSB), when executing the call instruction, assumes that the corresponding ret statement will jump to the label 2.

The part after the label 2 actually never gets executed, it's simply an infinite loop that would in theory fill the instruction pipeline with JMP instructions. By using LFENCE,PAUSE or more generally an instruction causing the instruction pipeline to be stall stops the CPU from wasting any power and time on this speculative execution. This is because in case the call to retpoline_call_target would return normally, the LFENCE would be the next instruction to be executed. This is also what the branch predictor will predict based on the original return address (the label 2)

To quote from Intel's architecture manual:

Instructions following an LFENCE may be fetched from memory before the LFENCE, but they will not execute until the LFENCE completes.

Note however that the specification never mentions that LFENCE and PAUSE cause the pipeline to stall, so I'm reading a bit between the lines here.

Now back to your original question: The kernel memory information disclosure is possible because of the combination of two ideas:

  • Even though speculative execution should be side-effect free when the speculation was wrong, speculative execution still affects the cache hierarchy. This means that when a memory load is executed speculatively, it may still have caused a cache line to be evicted. This change in the cache hierarchy can be identified by carefully measuring the access time to memory that is mapped onto the same cache set.
    You can even leak some bits of arbitrary memory when the source address of the memory read was itself read from kernel memory.

  • The indirect branch predictor of Intel CPUs only uses the lowermost 12 bits of the source instruction, thus it is easy to poison all 2^12 possible prediction histories with user-controlled memory addresses. These can then, when the indirect jump is predicted within the kernel, be speculatively executed with kernel privileges. Using the cache-timing side-channel, you can thus leak arbitrary kernel memory.

UPDATE: On the kernel mailing list, there is an ongoing discussion that leads me to believe retpolines don't fully mitigate the branch prediction issues, as when the Return Stack Buffer (RSB) runs empty, more recent Intel architectures (Skylake+) fall back to the vulnerable Branch Target Buffer (BTB):

Retpoline as a mitigation strategy swaps indirect branches for returns, to avoid using predictions which come from the BTB, as they can be poisoned by an attacker. The problem with Skylake+ is that an RSB underflow falls back to using a BTB prediction, which allows the attacker to take control of speculation.

Schoolboy answered 4/1, 2018 at 16:25 Comment(8)
I don't think the LFENCE instruction is important, Google's implementation uses a PAUSE instruction instead. support.google.com/faqs/answer/7625886 Note that documentation you've quotes says "will not execute" not will "will not be speculatively executed".Cappello
From that Google FAQ page: "The pause instructions in our speculative loops above are not required for correctness. But it does mean that non-productive speculative execution occupies less functional units on the processor." So it doesn't support your conclusion that LFENCE is they key here.Cappello
@RossRidge I agree partially, to me this looks like two possible implementations of an infinite loop that hint the CPU to not speculatively execute the code following the PAUSE/LFENCE. However if the LFENCE was executed speculatively and not rolled back because the speculation was correct, this would contradict the claim that it will only be executed once the memory loads have finished. (Otherwise, the whole set of instructions that have been executed speculatively would have to be rolled back and executed again to fulfill the specifications)Schoolboy
@RossRidge I think I now understand your argument, I moved the emphasis from the LFENCE/PAUSE to the infinite loop filling the instruction pipeline with JMPs.Schoolboy
Hoes the ultimate branch target (the one the calling code wants to jump to) get on the stack? The code snipped above will ultimately ret to a stack location which must have been populated before this thunk was called.Cataclinal
@Cataclinal That's achieved by the new NOSPEC_CALL and NOSPEC_JMP macros in jump-asm.h. Most of the other work in this branch seems to be replacing indirect calls and jumps by these macros. I'll include the details in my answerSchoolboy
This has the advantage of push / ret that it doesn't unbalance the return-address predictor stack. There's one mispredict (going to the lfence before the actual return address is used), but using a call + modifying rsp balanced out that ret.Afire
oops, advantage over push / ret (in my last comment). re: your edit: RSB underflow should be impossible because the retpoline includes a call. If kernel pre-emption did a context switch there, we'd resume execution with the RSB primed from the call into the scheduler. But maybe an interrupt handler could end with enough rets to empty the RSB.Afire
C
52

A retpoline is designed to protect against the branch target injection (CVE-2017-5715) exploit. This is an attack where an indirect branch instruction in the kernel is used to force the speculative execution of an arbitrary chunk of code. The code chosen is a "gadget" that is somehow useful to attacker. For example code can be chosen so that will leak kernel data through how it affects the cache. The retpoline prevents this exploit by simply replacing all indirect branch instructions with a return instruction.

I think what's key about the retpoline is just the "ret" part, that it replaces the indirect branch with a return instruction so that the CPU uses the return stack predictor instead of the exploitable branch predictor. If a simple push and a return instruction was used instead then the code that would be speculatively executed would be the code the function will eventually return to anyways, not some gadget useful to the attacker. The main benefit of the trampoline part seems to be to maintain the return stack so when the function actually does return to its caller this is predicted correctly.

The basic idea behind branch target injection is simple. It takes advantage of the fact the CPU doesn't record the full address of the source and destination of branches in its branch target buffers. So the attacker can fill the buffer using jumps in its own address space that will result prediction hits when a particular indirect jump is executed in the kernel address space.

Note that retpoline doesn't prevent kernel information disclosure directly, it only prevents indirect branch instructions from being used to speculatively execute a gadget that would disclose information. If the attacker can find some other means to speculatively execute the gadget then the retpoline doesn't prevent the attack.

The paper Spectre Attacks: Exploiting Speculative Execution by Paul Kocher, Daniel Genkin, Daniel Gruss, Werner Haas, Mike Hamburg, Moritz Lipp, Stefan Mangard, Thomas Prescher, Michael Schwarz, and Yuval Yarom give the following overview of how indirect branches can be exploited:

Exploiting Indirect Branches. Drawing from return oriented programming (ROP), in this method the attacker chooses a gadget from the address space of the victim and influences the victim to execute the gadget speculatively. Unlike ROP, the attacker does not rely on a vulnerability in the victim code. Instead, the attacker trains the Branch Target Buffer (BTB) to mispredict a branch from an indirect branch instruction to the address of the gadget, resulting in a speculative execution of the gadget. While the speculatively executed instructions are abandoned, their effects on the cache are not reverted. These effects can be used by the gadget to leak sensitive information. We show how, with a careful selection of a gadget, this method can be used to read arbitrary memory from the victim.

To mistrain the BTB, the attacker finds the virtual address of the gadget in the victim’s address space, then performs indirect branches to this address. This training is done from the attacker’s address space, and it does not matter what resides at the gadget address in the attacker’s address space; all that is required is that the branch used for training branches to use the same destination virtual address. (In fact, as long as the attacker handles exceptions, the attack can work even if there is no code mapped at the virtual address of the gadget in the attacker’s address space.) There is also no need for a complete match of the source address of the branch used for training and the address of the targetted branch. Thus, the attacker has significant flexibility in setting up the training.

A blog entry titled Reading privileged memory with a side-channel by the Project Zero team at Google provides another example of how branch target injection can be used to create a working exploit.

Cappello answered 4/1, 2018 at 21:55 Comment(0)
C
11

This question was asked a while ago, and deserves a newer answer.

Executive Summary:

“Retpoline” sequences are a software construct which allow indirect branches to be isolated from speculative execution. This may be applied to protect sensitive binaries (such as operating system or hypervisor implementations) from branch target injection attacks against their indirect branches.

The word "retpoline" is a portmanteau of the words "return" and "trampoline", much like the improvement "relpoline" was coined from "relative call" and "trampoline". It is a trampoline construct constructed using return operations which also figuratively ensures that any associated speculative execution will “bounce” endlessly.

In order to mitigate against kernel or cross-process memory disclosure (the Spectre attack), the Linux kernel [1] will be compiled with a new option, -mindirect-branch=thunk-extern introduced to gcc to perform indirect calls through a so-called retpoline.

[1] It's not Linux specific, however - similar or identical construct seems to be used as part of the mitigation strategies on other OSes.

The use of this compiler option only protects against Spectre V2 in affected processors that have the microcode update required for CVE-2017-5715. It will 'work' on any code (not just a kernel), but only code containing "secrets" is worth attacking.

This appears to be a newly invented term as a Google search turns up only very recent use (generally all in 2018).

The LLVM compiler has had a -mretpoline switch since before Jan 4 2018. That date is when the vulnerability was first publically reported. GCC made their patches available Jan 7, 2018.

The CVE date suggests that the vulnerability was 'discovered' in 2017, but it affects some of the processors manufactured in the past two decades (thus it was likely discovered long ago).

What is a retpoline and how does it prevent the recent kernel information disclosure attacks?

First, a few definitions:

  • Trampoline - Sometimes referred to as indirect jump vectors trampolines are memory locations holding addresses pointing to interrupt service routines, I/O routines, etc. Execution jumps into the trampoline and then immediately jumps out, or bounces, hence the term trampoline. GCC has traditionally supported nested functions by creating an executable trampoline at run time when the address of a nested function is taken. This is a small piece of code which normally resides on the stack, in the stack frame of the containing function. The trampoline loads the static chain register and then jumps to the real address of the nested function.

  • Thunk - A thunk is a subroutine used to inject an additional calculation into another subroutine. Thunks are primarily used to delay a calculation until its result is needed, or to insert operations at the beginning or end of the other subroutine

  • Memoization - A memoized function "remembers" the results corresponding to some set of specific inputs. Subsequent calls with remembered inputs return the remembered result rather than recalculating it, thus eliminating the primary cost of a call with given parameters from all but the first call made to the function with those parameters.

Very roughly, a retpoline is a trampoline with a return as a thunk, to 'spoil' memoization in the indirect branch predictor.

Source: The retpoline includes a PAUSE instruction for Intel, but an LFENCE instruction is necessary for AMD since on that processor the PAUSE instruction is not a serializing instruction, so the pause/jmp loop will use excess power as it is speculated over waiting for return to mispredict to the correct target.

Arstechnica has a simple explanation of the problem:

"Each processor has an architectural behavior (the documented behavior that describes how the instructions work and that programmers depend on to write their programs) and a microarchitectural behavior (the way an actual implementation of the architecture behaves). These can diverge in subtle ways. For example, architecturally, a program that loads a value from a particular address in memory will wait until the address is known before trying to perform the load. Microarchitecturally, however, the processor might try to speculatively guess at the address so that it can start loading the value from memory (which is slow) even before it's absolutely certain of which address it should use.

If the processor guesses wrong, it will ignore the guessed-at value and perform the load again, this time with the correct address. The architecturally defined behavior is thus preserved. But that faulty guess will disturb other parts of the processor—in particular the contents of the cache. These microarchitectural disturbances can be detected and measured by timing how long it takes to access data that should (or shouldn't) be in the cache, allowing a malicious program to make inferences about the values stored in memory.".

From Intel's paper: "Retpoline: A Branch Target Injection Mitigation" (.PDF):

"A retpoline sequence prevents the processor’s speculative execution from using the "indirect branch predictor" (one way of predicting program flow) to speculate to an address controlled by an exploit (satisfying element 4 of the five elements of branch target injection (Spectre variant 2) exploit composition listed above).".

Note, element 4 is: "The exploit must successfully influence this indirect branch to speculatively mispredict and execute a gadget. This gadget, chosen by the exploit, leaks the secret data via a side channel, typically by cache-timing.".

Culinarian answered 22/11, 2018 at 16:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.