How far does the branch prediction go?
Asked Answered
B

1

2

I understand there is a branch predictor in modern CPU designs trying to guess which branch to go.

Assuming there is a jump instruction that will transfer control flow to either basic block A or basic block B.

If the predictor decides to go to A, when the actual calculation comes to the jump instruction, and finds out B should be the right choice instead of A, at this time, how far does the execution in basic block A go?

Are all the instructions in basic block A are done executed? Or just the first instruction is executed?

How can we find out the actual result and know more about the branch prediction strategies?

Blas answered 6/8, 2019 at 4:21 Comment(1)
The CPU doesn't really know about basic blocks.Hedva
R
2

The CPU assumes branch prediction was correct and continues unless/until it discovers it wasn't. (HW can't detect "basic blocks": it doesn't know when it reaches an address that some other instruction branches to. And you wouldn't want to stop anyway. Modern branch prediction good enough to be usable in an out-of-order CPU is usually correct like 95 to 99% of the time.)

Discovering a mispredict (or confirming a correct prediction) happens when the branch instruction itself is decoded (unconditional direct branch) or executed (conditional and/or indirect).

In case of a mispredict, the CPU has to re-steer the front-end (fetch/decode) to the correct path. On an in-order CPU, no instructions after a branch can execute until the branch itself executes, so it's always just a matter of restarting fetch/decode.

(In-order superscalar could actually execute an instruction after a branch, but an in-order pipeline makes it relatively easy to squash before it reaches write-back and actually changes architectural state. A store is probably the trickiest because you need to discard that store-buffer entry; its visible effect would be on memory, not write-back to registers. But anyway, that along with decoupling execution from cache-miss stores, and other reasons, is why even in-order pipelined CPUs have store buffers.)


Or for an out-of-order CPU with speculative execution that allows instruction from the wrong path to actually execute while a conditional or indirect branch is waiting for its input, it has to flush the back-end and restart issue from the correct path of execution.

(With fast-recovery and a branch-order-buffer, this can happen even if some of the instructions before the branch haven't finished executing yet. e.g. in a loop with a simple loop condition and a longer dependency chain in the loop body, so execution of the loop-condition dependency chain can run ahead and discover a mispredict in the last iteration when the loop branch falls through, without waiting until that instruction is ready to retire. i.e. without waiting for the loop body to execute that far.)

Multiple branches can be in flight at once. And of course any load or store can fault. An OoO exec CPU basically treats everything as speculative until retirement.

The limiting factor on number of un-executed branches in flight might be the size of the branch-order-buffer (BOB), in a case where that fills up before the ROB (reorder buffer) that has to track every not-yet-retired instruction.

I expect that a branch can leave that BOB after executing (confirming the prediction or detecting a mispredict and initiating recovery), without having to retire. If something earlier detected a mis-speculation, the branch was on the wrong path anyway so its inputs were potentially wrong. If the same branch instruction gets into the pipeline along the newly-determined correct path of execution, it'll issue from the front-end again and get allocated a new BOB entry.

So there's no need to keep a BOB entry allocated until retirement; freeing it earlier can let it be re-allocated to a younger branch that hasn't yet executed. (Potentially there's some old cache-miss load that isn't a data dependency for a bunch of later branches, letting the ROB fill with instructions that can mostly finish executing before that load eventually arrives, with the dust settling on all the dependency chains and speculation.)

A typical(?) BOB size is 48 entries in Skylake. Older or lower-end CPUs might have a significantly smaller BOB to go with their smaller out-of-order speculation window.

On a CPU without fast-recovery (no BOB), other out-of-order execution resources would be the upper limit on depth of branch speculation, like the reorder buffer (unretired ops) and/or the scheduler (unexecuted ops).


How can we ... know more about the branch prediction strategies?

https://danluu.com/branch-prediction/ is pretty good. See also the first chapter of Agner Fog's microarch guide (for x86) where he covers what real Intel and AMD CPUs do, as well as some background. https://agner.org/optimize/

Related re: modern IT-TAGE predictors (Haswell and Zen 2 and later):


Related re: pipeline stuff:

Rarotonga answered 6/8, 2019 at 4:52 Comment(3)
The CPU would continue executing instructions until it hits the first structural limit (such as ROB size), or until it determines that one of the instructions on the predicted path has faulted (there is no point in executing later instructions), or until it decides to handle an interrupt as soon as possible (in this case, we don't want to allocate more instructions in the ROB). There could be a limit on the number of in-flight branches, but then there would be multiple basic blocks in the predicted path, and the question is about a single basic block.Overwinter
@HadiBrais: yes, existing Intel CPUs with fast-recovery via a BoB (Nehalem and later, IIRC) do have a limit on in-flight branches that's smaller than the RS size. Anyway yes there are limits to the out-of-order window size, but as you say none of them have anything to do with basic blocks exactly, and only one of them has anything to do with in-flight branches.Rarotonga
An in-order superscalar could execute instructions after a branch in parallel with branch execution. With a skewed pipeline that delays branch evaluation and other computation instructions to hide load-to-use latency, a load after a branch could execute before the branch is resolved. Squashing writeback of results of instructions in a mispredicted path avoids state corruption.Equi

© 2022 - 2024 — McMap. All rights reserved.