Small branches in modern CPUs

How do modern CPUs like Kaby Lake handle small branches? (in code below it is the jump to label LBB1_67). From what I know the branch will not be harmful because the jump is inferior to the 16-bytes block size which is the size of the decoding window.

Or is it possible that due to some macro op fusion the branch will be completely elided?

        sbb     rdx, qword ptr [rbx - 8]
        setb    r8b
        setl    r9b
        mov     rdi, qword ptr [rbx]
        mov     rsi, qword ptr [rbx + 8]
        vmovdqu xmm0, xmmword ptr [rbx + 16]
        cmp     cl, 18
        je      .LBB1_67
        mov     r9d, r8d
.LBB1_67:                               #   in Loop: Header=BB1_63 Depth=1
        vpcmpeqb        xmm0, xmm0, xmmword ptr [rbx - 16]
        vpmovmskb       ecx, xmm0
        cmp     ecx, 65535
        sete    cl
        cmp     rdi, qword ptr [rbx - 32]
        sbb     rsi, qword ptr [rbx - 24]
        setb    dl
        and     dl, cl
        or      dl, r9b

There are no special cases for short branch distances in any x86 CPUs. Even unconditional jmp to the next instruction (architecturally a nop) needs correct branch prediction to be handled efficiently; if you put enough of those in a row you run out of BTB entries and performance falls off a cliff. Slow jmp-instruction

Fetch/decode is only a minor problem; yes a very short branch within the same cache line will still hit in L1i and probably uop cache. But it's unlikely that the decoders would special-case a predicted-taken forward jump and make use of pre-decode instruction-boundary finding from one block that included both the branch and the target.

When the instruction is being decoded to uops and fed into the front-end, register values aren't available; those are only available in the out-of-order execution back-end.

The major problem is that when the instructions after .LBB1_67: execute, the architectural state is different depending on whether the branch was taken or not. And so is the micro-architectural state (RAT = Register Allocation Table).

Either:

r9 depends on the sbb/setl result (mov r9d, r8d didn't run)
r9 depends on the sbb/setb result (mov r9d, r8d did run)

Conditional branches are called "control dependencies" in computer-architecture terminology. Branch-prediction + speculative execution avoids turning control dependencies into data dependencies. If the je was predicted not taken, the setl result (the old value of r9) is overwritten by mov and is no longer available anywhere.

There's no way to recover from this after detecting a misprediction in the je (actually should have been taken), especially in the general case. Current x86 CPUs don't try to look for the fall-through path rejoining the taken path or figuring out anything about what it does.

If cl wasn't ready for a long time, so a mispredict wasn't discovered for a long time, many instructions after the or dl, r9b could have executed using the wrong inputs. In the general case the only way to reliably + efficiently recover is to discard all work done on instructions from the "wrong" path. Detecting that vpcmpeqb xmm0, [rbx - 16] for example still runs either way is hard, and not looked for. (Modern Intel, since Sandybridge, has a Branch Order Buffer (BOB) that snapshots the RAT on branches, allowing efficient rollback to the branch miss as soon as execution detects it while still allowing out-of-order execution on earlier instructions to continue during the rollback. Before that a branch miss had to roll back to the retirement state.)

Some CPUs for some non-x86 ISAs (e.g. PowerPC I think) have experimented with turning forward branches that skip exactly 1 instruction into predication (data dependency) instead of speculating past them. e.g. Dynamic Hammock Predication for Non-predicated Instruction Set Architectures discusses this idea, and even deciding whether to predicate or not on a per-branch basis. If your branch-prediction history says this branch predicts poorly, predicating it instead could be good. (A Hammock branch is one that jumps forward over one or a couple instructions. Detecting the exactly 1 instruction case is trivial on an ISA with fixed-width instruction words, like a RISC, but hard on x86.)

In this case, x86 has a cmovcc instruction, an ALU select operation that produces one of the two inputs depending on a flag condition. cmove r9d, r8d instead of cmp/je would make this immune to branch mispredictions, but at the cost of introducing a data dependency on cl and r8d for instructions that use r9d. Intel CPU don't try to do this for you.

(On Broadwell and later Intel, cmov is only 1 uop, down from 2. cmp/jcc is 1 uop, and the mov itself is also 1 uop, so in the not-taken case cmov is also fewer uops for the front-end. And in the taken case, a taken branch can introduce bubbles in the pipeline even if predicted correctly, depending on how high throughput the code is: Whether queues between stages can absorb it.)

See gcc optimization flag -O3 makes code slower than -O2 for a case where CMOV is slower than a branch because introducing a data dependency is bad.

Recommended topics

Hot tags