How far does the branch prediction go?

The CPU assumes branch prediction was correct and continues unless/until it discovers it wasn't. (HW can't detect "basic blocks": it doesn't know when it reaches an address that some other instruction branches to. And you wouldn't want to stop anyway. Modern branch prediction good enough to be usable in an out-of-order CPU is usually correct like 95 to 99% of the time.)

Discovering a mispredict (or confirming a correct prediction) happens when the branch instruction itself is decoded (unconditional direct branch) or executed (conditional and/or indirect).

In case of a mispredict, the CPU has to re-steer the front-end (fetch/decode) to the correct path. On an in-order CPU, no instructions after a branch can execute until the branch itself executes, so it's always just a matter of restarting fetch/decode.

(In-order superscalar could actually execute an instruction after a branch, but an in-order pipeline makes it relatively easy to squash before it reaches write-back and actually changes architectural state. A store is probably the trickiest because you need to discard that store-buffer entry; its visible effect would be on memory, not write-back to registers. But anyway, that along with decoupling execution from cache-miss stores, and other reasons, is why even in-order pipelined CPUs have store buffers.)

Or for an out-of-order CPU with speculative execution that allows instruction from the wrong path to actually execute while a conditional or indirect branch is waiting for its input, it has to flush the back-end and restart issue from the correct path of execution.

(With fast-recovery and a branch-order-buffer, this can happen even if some of the instructions before the branch haven't finished executing yet. e.g. in a loop with a simple loop condition and a longer dependency chain in the loop body, so execution of the loop-condition dependency chain can run ahead and discover a mispredict in the last iteration when the loop branch falls through, without waiting until that instruction is ready to retire. i.e. without waiting for the loop body to execute that far.)

Multiple branches can be in flight at once. And of course any load or store can fault. An OoO exec CPU basically treats everything as speculative until retirement.

The limiting factor on number of un-executed branches in flight might be the size of the branch-order-buffer (BOB), in a case where that fills up before the ROB (reorder buffer) that has to track every not-yet-retired instruction.

I expect that a branch can leave that BOB after executing (confirming the prediction or detecting a mispredict and initiating recovery), without having to retire. If something earlier detected a mis-speculation, the branch was on the wrong path anyway so its inputs were potentially wrong. If the same branch instruction gets into the pipeline along the newly-determined correct path of execution, it'll issue from the front-end again and get allocated a new BOB entry.

So there's no need to keep a BOB entry allocated until retirement; freeing it earlier can let it be re-allocated to a younger branch that hasn't yet executed. (Potentially there's some old cache-miss load that isn't a data dependency for a bunch of later branches, letting the ROB fill with instructions that can mostly finish executing before that load eventually arrives, with the dust settling on all the dependency chains and speculation.)

A typical(?) BOB size is 48 entries in Skylake. Older or lower-end CPUs might have a significantly smaller BOB to go with their smaller out-of-order speculation window.

On a CPU without fast-recovery (no BOB), other out-of-order execution resources would be the upper limit on depth of branch speculation, like the reorder buffer (unretired ops) and/or the scheduler (unexecuted ops).

How can we ... know more about the branch prediction strategies?

https://danluu.com/branch-prediction/ is pretty good. See also the first chapter of Agner Fog's microarch guide (for x86) where he covers what real Intel and AMD CPUs do, as well as some background. https://agner.org/optimize/

Related re: modern IT-TAGE predictors (Haswell and Zen 2 and later):

Related re: pipeline stuff:

Recommended topics

Hot tags