What happens with nested branches and speculative execution?

Asked 6/12, 2019 at 8:42 Answered 10/12, 2019 at 6:22

Solved cpu-architecture nested-if branch-prediction speculative-execution

Alright, so I know that if a particular conditional branch has a condition that takes time to compute (memory access, for instance), the CPU assumes a condition result and speculatively executes along that path. However, what would happen if, along that path, yet another slow conditional branch pops up (assuming, of course, that the first condition hasn't been resolved yet and the CPU can't just commit the changes)? Does the CPU just speculate inside the speculation? What happens if the last condition is mispredicted but the first wasn't? Does it just rollback all the way?

I'm talking about something like this:

if (value_in_memory == y){
   // computations
   if (another_val_memory == x){
      //computations
   }
}

Thrush answered 6/12, 2019 at 8:42 Comment(2)

Writing an answer which I might finish, but the main point is that everything is always treated as speculative because any load or store might fault, or ALU division might trap with a divide exception, etc. So 2 branches in flight at once isn't actually special. With fast-recovery, branch mis-speculation can be caught sooner, and rollback to before the branch can be done while other speculation is still in flight. See What exactly happens when a skylake CPU mispredicts a branch?. – Piperine 6/12, 2019 at 9:26

@PeterCordes So even "regular" instructions are executed speculatively before being commited, and the only distinction between them is a human-made distinction, not computer-made? I presume, then, that the CPU stores multiple, possible rollback points? For instance if I have load instructions that may lead to page faults or simply use stale values, inside a conditional branch, the CPU identifies such instructions and scenarios and saves a state for each of them? I feel like I misunderstood because this may lead to a lot of storing register states and complicated dependencies... – Thrush 6/12, 2019 at 11:40

Speculative execution is the regular state of execution, not a special mode that an out of order CPU enters when it sees a branch and then leaves when the branch is no longer in flight.

This is easier to see if you consider that it's not just branches that can fault, but many instructions, including those that access memory, have restrictions on their input values, etc. So any substantial out of order execution implies constant speculation, and CPUs are built around that idea.

So "nested branches" doesn't end up being special in that sense.

Now, modern CPUs have a variety of methods for quick branch misprediction recovery, faster than recovery from other types of faults¹. For example they may snapshot the state of the register mapping at some branches, to allow recovery to start before the branch is at the head of the reorder buffer. Since it is not always feasible to snapshot at all branches, there might be complicated heuristics involved to decide where to take snapshots.

I mention this last part because it is one way in which nested branches might matter: when there are lots of branches in flight, you might hit some microarchitectural limits related to the tracking of these branches for recovery purposes. For more details, you can look through patents for "branch order buffer" (for Intel techniques, but there are no doubt others).

¹ The basic recovery method is keep executing until the faulting instruction is the next to retire, and then throw away all younger instructions. In the context of branch mispredictions, this means you could actually suffer two or more mispredictions only the oldest of which actually takes effect: e.g., a younger branch mispredicts, and while executing up to that branch (at which point recovery can occur), another mispredict occurs, so the younger one ends up getting discarded.

Delaware answered 6/12, 2019 at 22:29 Comment(9)

I think the BoB has to snapshot the whole RAT (register allocation table) to enable fast-recovery, not just which physical registers hold the current architectural state. What exactly happens when a skylake CPU mispredicts a branch?. The nice thing about fast recovery is that out-of-order exec of independent work before the branch can continue while rollback is happening (to the correct path detected by executing the branch). That would be a lot of state, though, IDK if a 40-entry BoB is really 40x the side of the RAT. Presumably there's some trickery. – Piperine 7/12, 2019 at 4:12

@Peter - the RAT mostly only holds logical to phys mapping though, doesn't it? About the BOB, I don't think the number of entries corresponds to the number of possible snapshots (number of so-called shadow RATs), there could be many fewer snapshots, which is what I mean about heuristic allocation. – Delaware 7/12, 2019 at 5:0

I was thinking the RAT needed to hold the logical -> phys mappings over the whole history of not-yet-retired instructions, and which physical regs are allocated to which uop. After rolling back to a shadow RAT, retirement from there needs to be able to keep updating the retirement state to match instructions being retired. Can the RAT really be as small as 1 entry per architectural reg? (This is normally something that Just Works so I haven't thought about it in this level of detail for a while.) – Piperine 7/12, 2019 at 5:10

@PeterCordes - yeah, it's probably partly semantics. Something needs to track that, since physical registers need to be freed at some point, but it doesn't have to be the RAT whose basic job is simply to keep track of the current mappings to support renaming. Since the RAT is a highly ported thing on the critical single-cycle renaming loop, it makes sense to keep it as compact as possible. The fully mapping can be tracked in something like a PRRT (post retirement reclaim table), and this thing can be slower because retiring an instruction can actually take several ... – Delaware 7/12, 2019 at 20:18

... cycles (well, logically, there is a single cycle during which the instruction retires, but then additional work may occur before all the resources are fully released). Still, your point stands - whether that tracking happens in the RAT itself, or another place that may or may not be part of the RAT, that info has to be restored, so I'll edit the answer. – Delaware 7/12, 2019 at 20:21

Reading Intel's patent from 2000 (albeit old, I'm assuming it to be applicable to modern CPU's since big changes were probably only made starting 2018 when Spectre and Meltdown were discovered) they do mention that they use multiple shadow RATs to store multiple states in case of multiple branches. From what I could gather, they just fill the RATs as branches show up and stop doing so when all RATs are filled, making sure that one RAT holds the oldest (uncommited) branch state, but continuing speculative execution (without saving furhter snapshots). Thank you both for your answers! – Thrush 8/12, 2019 at 16:38

@C.Pinto - yeah, I also dug though the patents and one point and came to a similar conclusion. The one thing I'm not sure about is "they just fill the RATs as branches show up and stop doing so when all RATs are filled", since it would seem to perform very poorly in some common cases: e.g., a sometimes unpredictable branch in a loop (with a very predictable loop branch). If you waste all your snapshots on the predictable branches, you'll do a worse job on the unpredictable ones. So I thought I recall reading something that suggested that the ... – Delaware 9/12, 2019 at 17:26

... snapshot could be taken or not depending on how likely it was to be useful, "saving" snapshots for later branches that might be less predictable. I could be wrong and even if it does appear in a patent it doesn't mean it works like that. – Delaware 9/12, 2019 at 17:27

@Delaware Is it valid to say that instructions in ROB can speculatively execution no matter the control depencies/branch prediction? – Delao 27/8, 2021 at 18:34

(Maybe not a complete answer, but I had some of this written when @BeeOnRope posted an answer. Posting this anyway for some more links and technical details in case anyone's curious.)

Everything is always speculative until it reaches retirement and becomes non-speculative, definitely happened, part of the architectural state.

e.g. any load might fault with a bad address, any div might trap on divide by zero. See also Out-of-order execution vs. speculative execution That and What exactly happens when a skylake CPU mispredicts a branch? mention that branch mispredicts are handled specially, because they're expected to be frequent. Fast-recovery can start before a mis-predicted branch reaches retirement, unlike the behaviour for a faulting load for example. (That's part of why Meltdown is exploitable.)

So even "regular" instructions are executed speculatively before being commited, and the only distinction between them is a human-made distinction, not computer-made? I presume, then, that the CPU stores multiple, possible rollback points? For instance if I have load instructions that may lead to page faults or simply use stale values, inside a conditional branch, the CPU identifies such instructions and scenarios and saves a state for each of them? I feel like I misunderstood because this may lead to a lot of storing register states and complicated dependencies.

The retirement state is always consistent so you can always roll back to there and discard all in-flight work, e.g. if an external interrupt arrives you want to handle it without waiting for a chain of a dozen cache miss loads to all execute. When an interrupt occurs, what happens to instructions in the pipeline?

This tracking basically happens for free or is something you need to do anyway to be able to detect which instruction faulted, not just that there was a problem somewhere. (This is called "precise exceptions")

The real distinction humans can usefully make is speculation that has a real chance of being wrong during execution of non-error cases. If your code gets a bad pointer, it doesn't really matter how it performs; it's going to page-fault and that's going to be very slow compared to local OoO exec details.

You're talking about a modern out-of-order (OoO) execution (not just fetch) CPU, like modern Intel or AMD x86, high-end ARM, MIPS r10000, etc.

The front-end is in-order (with speculation down predicted paths), and so is commit (aka retirement) from the out-of-order back-end into non-speculative retirement state. (aka known-good architectural state).

The CPU uses two major structures to track instructions (or on x86, uops = parts of instructions) in the back-end. The last stage of the front-end (after fetch / decode) allocates/renames instructions and adds them into both of these structures at once.

RS = Reservation Station = scheduler: not-yet-executed instructions, waiting for an execution unit. The RS tracks dependencies and sends the oldest-ready uops to execution units that are ready.
ROB = ReOrder Buffer: not-yet-retired instructions. Instructions enter and leave in-order so it can just be a circular buffer.

Includes a flag to mark each entry as executed or not, set once the RS has sent it to an execution unit which reports success. The oldest instructions in the ROB that all have their done-executing bit set can "retire".

Also includes a flag which indicates "fault if this reaches retirement". This avoids spending time handling page faults from load instruction on the wrong path of execution (that might well have pointers into an unmapped page), for example. Either in the shadow of a branch mispredict, or just after another instruction (in program order) that should have faulted first but OoO exec got to it later.

(I'm also leaving out register-renaming onto a large physical register file. That's the "rename" part. Allocate includes choosing which execution port an instruction will use, and reserving a load or store buffer entry for memory instructions.)

(There's also a store-buffer; stores don't write directly to L1d cache, they write to the store buffer. This makes it possible to speculatively execute stores and still roll back without them becoming visible to other cores. It also decouples cache-miss stores from execution. Once a store instruction retires, the store-buffer entry "graduates" and is eligible to commit to L1d cache, once MESI gets exclusive access to the cache line, and once memory-ordering rules are satisfied.)

Execution units detect whether an instruction should fault, or was mis-speculated and should roll back, but don't necessarily act on that until the instruction reaches retirement.

In-order retirement is the step that recovers program-order after OoO exec, including the case of exceptions of mis-speculation.

Terminology: Intel calls it "issue" when instructions are sent from the front-end into the ROB + RS. Other computer architecture people often call that "dispatch".

Sending uops from the RS to execution units is called "dispatch" by Intel, "issue" by other people.

Piperine answered 10/12, 2019 at 6:22 Comment(0)

Recommended topics

Hot tags