There are no special cases for short branch distances in any x86 CPUs. Even unconditional jmp
to the next instruction (architecturally a nop) needs correct branch prediction to be handled efficiently; if you put enough of those in a row you run out of BTB entries and performance falls off a cliff. Slow jmp-instruction
Fetch/decode is only a minor problem; yes a very short branch within the same cache line will still hit in L1i and probably uop cache. But it's unlikely that the decoders would special-case a predicted-taken forward jump and make use of pre-decode instruction-boundary finding from one block that included both the branch and the target.
When the instruction is being decoded to uops and fed into the front-end, register values aren't available; those are only available in the out-of-order execution back-end.
The major problem is that when the instructions after .LBB1_67:
execute, the architectural state is different depending on whether the branch was taken or not.
And so is the micro-architectural state (RAT = Register Allocation Table).
Either:
r9
depends on the sbb
/setl
result (mov r9d, r8d
didn't run)
r9
depends on the sbb
/setb
result (mov r9d, r8d
did run)
Conditional branches are called "control dependencies" in computer-architecture terminology. Branch-prediction + speculative execution avoids turning control dependencies into data dependencies. If the je
was predicted not taken, the setl
result (the old value of r9
) is overwritten by mov
and is no longer available anywhere.
There's no way to recover from this after detecting a misprediction in the je
(actually should have been taken), especially in the general case. Current x86 CPUs don't try to look for the fall-through path rejoining the taken path or figuring out anything about what it does.
If cl
wasn't ready for a long time, so a mispredict wasn't discovered for a long time, many instructions after the or dl, r9b
could have executed using the wrong inputs. In the general case the only way to reliably + efficiently recover is to discard all work done on instructions from the "wrong" path. Detecting that vpcmpeqb xmm0, [rbx - 16]
for example still runs either way is hard, and not looked for. (Modern Intel, since Sandybridge, has a Branch Order Buffer (BOB) that snapshots the RAT on branches, allowing efficient rollback to the branch miss as soon as execution detects it while still allowing out-of-order execution on earlier instructions to continue during the rollback. Before that a branch miss had to roll back to the retirement state.)
Some CPUs for some non-x86 ISAs (e.g. PowerPC I think) have experimented with turning forward branches that skip exactly 1 instruction into predication (data dependency) instead of speculating past them. e.g. Dynamic Hammock Predication
for Non-predicated Instruction Set Architectures discusses this idea, and even deciding whether to predicate or not on a per-branch basis. If your branch-prediction history says this branch predicts poorly, predicating it instead could be good. (A Hammock branch is one that jumps forward over one or a couple instructions. Detecting the exactly 1 instruction case is trivial on an ISA with fixed-width instruction words, like a RISC, but hard on x86.)
In this case, x86 has a cmovcc
instruction, an ALU select operation that produces one of the two inputs depending on a flag condition. cmove r9d, r8d
instead of cmp
/je
would make this immune to branch mispredictions, but at the cost of introducing a data dependency on cl
and r8d
for instructions that use r9d
. Intel CPU don't try to do this for you.
(On Broadwell and later Intel, cmov is only 1 uop, down from 2. cmp/jcc is 1 uop, and the mov
itself is also 1 uop, so in the not-taken case cmov
is also fewer uops for the front-end. And in the taken case, a taken branch can introduce bubbles in the pipeline even if predicted correctly, depending on how high throughput the code is: Whether queues between stages can absorb it.)
See gcc optimization flag -O3 makes code slower than -O2 for a case where CMOV is slower than a branch because introducing a data dependency is bad.