How are branch mispredictions handled before a hardware interrupt

Asked 29/1, 2019 at 14:11 Answered 29/1, 2019 at 21:50

Solved intel pipeline cpu-architecture interrupt-handling branch-prediction

A hardware interrupt occurs to a particular vector (not masked), CPU checks IF flag and pushes RFLAGS, CS and RIP to the stack, meanwhile there are still instructions completing in the back end, one of these instruction's branch predictions turns out to be wrong. Usually the pipeline would be flushed and the front end starts fetching from the correct address but in this scenario an interrupt is in progress.

When an interrupt occurs, what happens to instructions in the pipeline?

I have read this and clearly a solution is to immediately flush everything from the pipeline so that this doesn't occur and then generate the instructions to push the RFLAGS, CS, RIP to the location of the kernel stack in the TSS; however, the question arises, how does it know the (CS:)RIP associated with the most recent architectural state in order to be able to push it on the stack (given that the front end RIP would now be ahead). This is similar to the question of how the taken branch execution unit on port0 knows the (CS:)RIP of what should have been fetched when the take prediciton turns out to be wrong -- is the address encoded into the instruction as well as the prediction? The same issue arises when you think of a trap / exception, the CPU needs to push the address of the current instruction (fault) or the next instruction (trap) to the kernel stack, but how does it work out the address of this instruction when it is halfway down the pipeline -- this leads me to believe that the address must be encoded into the instruction and is worked out using the length information and this is possibly all done at predecode stage..

Alannaalano answered 29/1, 2019 at 14:11 Comment(0)

The CPU will presumably discard the contents of the ROB, rolling back to the latest retirement state before servicing the interrupt.

An in-flight branch miss doesn't change this. Depending on the CPU (older / simpler), it might have already been in the process of rolling back to retirement state and flushing because of a branch miss, when the interrupt arrived.

As @Hadi says, the CPU could choose at that point to retire the branch (with the interrupt pushing a CS:RIP pointing to the correct branch target), instead of leaving it to be re-executed after returning from the interrupt.

But that only works if the branch instruction was already ready to retire: there were no instructions older than the branch still not executed. Since it's important to discover branch misses as early as possible, I assume branch recovery starts when it discovers a mispredict during execution, not waiting until it reaches retirement. (This is unlike other kinds of faults: e.g. Meltdown and L1TF are based on a faulting load not triggering #PF fault handling until it reaches retirement so the CPU is sure there really is a fault on the true path of execution. You don't want to start an expensive pipeline flush until you're sure it wasn't in the shadow of a mispredict or earlier fault.)

But since branch misses don't take an exception, redirecting the front-end can start early before we're sure that the branch instruction is part of the right path in the first place.

e.g. cmp byte [cache_miss_load], 123 / je mispredicts but won't be discovered for a long time. Then in the shadow of that mispredict, a cmp eax, 1 / je on the "wrong" path runs and a mispredict is discovered for it. With fast recovery, uops past that are flushed and fetch/decode/exec from the "right" path can start before the earlier mispredict is even discovered.

To keep IRQ latency low, CPUs don't tend to give in-flight instructions extra time to retire. Also, any retired stores that still have their data in the store buffer (not yet committed to L1d) have to commit before any stores by the interrupt handler can commit. But interrupts are serializing (I think), and any MMIO or port-IO in a handler will probably involve a memory barrier or strongly-ordered store, so letting more instructions retire can hurt IRQ latency if they involve stores. (Once a store retires, it definitely needs to happen even while its data is still in the store buffer).

The out-of-order back-end always knows how to roll back to a known-good retirement state; the entire contents of the ROB are always considered speculative because any load or store could fault, and so can many other instructions¹. Speculation past branches isn't super-special.

Branches are only special in having extra tracking for fast recovery (the Branch Order Buffer in Nehalem and newer) because they're expected to mispredict with non-negligible frequency during normal operation. See What exactly happens when a skylake CPU mispredicts a branch? for some details. Especially David Kanter's quote:

Nehalem enhanced the recovery from branch mispredictions, which has been carried over into Sandy Bridge. Once a branch misprediction is discovered, the core is able to restart decoding as soon as the correct path is known, at the same time that the out-of-order machine is clearing out uops from the wrongly speculated path. Previously, the decoding would not resume until the pipeline was fully flushed.

(This answer is intentionally very Intel-centric because you tagged it intel, not x86. I assume AMD does something similar, and probably most out-of-order uarches for other ISAs are broadly similar. Except that memory-order mis-speculation isn't a thing on CPUs with a weaker memory model where CPUs are allowed to visibly reorder loads.)

Footnote 1: So can div, or any FPU instruction if FP exceptions are unmasked. And a denormal FP result could require a microcode assist to handle, even with FP exceptions masked like they are by default.

On Intel CPUs, a memory-order mis-speculation can also result in a pipeline nuke (load speculatively done early, before earlier loads complete, but the cache lost its copy of the line before the x86 memory model said the load could take its value).

Sanguineous answered 29/1, 2019 at 21:50 Comment(4)

Intel manual V3 Section 11.10 mentions that retired stores are drained when raising an interrupt or exception. AMD manual V2 Section 7.5 mentions that interrupt and exceptions are fully serializing events. I don't think this is guaranteed on Intel processors though (Is there any mention of this in the manual?). On both AMD and Intel processors, IRET is fully serializing. – Cacao 29/1, 2019 at 22:24

@HadiBrais: it's not something I've looked into, thanks for checking. I wondered if some CPUs might not even start running instructions from the interrupt handler until after stores commit, so it's interesting to find out that AMD is definitely like that. And maybe Intel as well, but that wording isn't 100% specific. And yeah, I knew about iret, but if we didn't serialize until after the IRQ handler's work is done, it wouldn't be as much of a problem for interrupt latency (just throughput cost). That's why I mentioned IRQ handler stores having to wait for the buffer to flush. – Sanguineous 29/1, 2019 at 22:28

Nice comment about meltdown, also about the store buffers, hadn't thought about those and whether they'd be flushed or not. @HadiBrais I wasn't previously aware of the terminology serialising instructions but I assume for IRET it just ensures that the mode cannot be switched back to user mode while there are still kernel mode stores occurring. I'm still reading. It is a lot to take in, especially when you try and picture how the circuitry logic works for all of the different conditions as well – Alannaalano 29/1, 2019 at 22:56

@LewisKelsey: on x86, "serializing" means fully flushing the pipeline including the store buffer before even executing the next instruction. Like cpuid. See MFENCE/SFENCE/etc "serialize memory but not instruction execution"? for serializing memory vs. instructions. See also How many memory barriers instructions does an x86 CPU have? for some mention of barriers vs. fully serializing. – Sanguineous 29/1, 2019 at 23:21

In general, each entry in the the ReOrder Buffer (ROB) has a field that is used to store enough information about the instruction address to reconstruct the whole instruction address unambiguously. It may be too costly to store the whole address for each instruction in the ROB. For instructions that have not yet been allocated (i.e., not yet passed the allocation stage of the pipeline), they need to carry this information with them at least until they reach the allocation stage.

If an interrupt and a branch misprediction occur at the same time, the proessor may, for example, choose to service the interrupt. In this case, all the instructions that are on the mispredicted path need to be flushed. The processor may choose also to flush other instructions that are on the correct path, but have not yet retired. All of these instructions are in the ROB and their instruction addresses are known. For each speculated branch, there is a tag that identifies all instructions on that speculated path and all instructions on this path are tagged with it. If there is another, later speculated branch, another tag is used, but it is also ordered with respect to the previous tag. Using these tags, the processor can determine exactly which instructions to flush when any of the speculated branches turns out to be incorrect. This is determined after the corresponding branch instruction completes execution in the branch execution unit. Branches may complete execution out of order. When the correct address of a msipredicted branch is calculated, it's forwarded to the fetch unit and the branch prediction unit (BPU). The fetch unit uses it to fetch instructions from the correct path and the BPU uses it to update its prediction state.

The processor can choose to retire the mispredicted branch instruction itself and flush all other later instructions. All rename registers are reclaimed and those physical registers that are mapped to architectural registers at the point the branch is retired are retained. At this point, the processor executes instructions to save the current state and then begins fetching instructions of the interrupt handler.

Cacao answered 29/1, 2019 at 20:19 Comment(1)

These tags for speculation past a specific branch only exist on CPUs new enough to have a Branch Order Buffer to allow efficient rollback just to a mispredict (on discovery of the branch miss). My understanding is that earlier CPUs flushed to the last known-good retirement state when a mispredicted branch reached retirement, instead of doing fast recovery while uops / instructions before the mispredicted branch were still executing. – Sanguineous 29/1, 2019 at 21:17

Recommended topics

Hot tags