Small branches in modern CPUs
Asked Answered
O

1

7

How do modern CPUs like Kaby Lake handle small branches? (in code below it is the jump to label LBB1_67). From what I know the branch will not be harmful because the jump is inferior to the 16-bytes block size which is the size of the decoding window.

Or is it possible that due to some macro op fusion the branch will be completely elided?

        sbb     rdx, qword ptr [rbx - 8]
        setb    r8b
        setl    r9b
        mov     rdi, qword ptr [rbx]
        mov     rsi, qword ptr [rbx + 8]
        vmovdqu xmm0, xmmword ptr [rbx + 16]
        cmp     cl, 18
        je      .LBB1_67
        mov     r9d, r8d
.LBB1_67:                               #   in Loop: Header=BB1_63 Depth=1
        vpcmpeqb        xmm0, xmm0, xmmword ptr [rbx - 16]
        vpmovmskb       ecx, xmm0
        cmp     ecx, 65535
        sete    cl
        cmp     rdi, qword ptr [rbx - 32]
        sbb     rsi, qword ptr [rbx - 24]
        setb    dl
        and     dl, cl
        or      dl, r9b
Oak answered 2/3, 2019 at 22:58 Comment(0)
P
5

There are no special cases for short branch distances in any x86 CPUs. Even unconditional jmp to the next instruction (architecturally a nop) needs correct branch prediction to be handled efficiently; if you put enough of those in a row you run out of BTB entries and performance falls off a cliff. Slow jmp-instruction

Fetch/decode is only a minor problem; yes a very short branch within the same cache line will still hit in L1i and probably uop cache. But it's unlikely that the decoders would special-case a predicted-taken forward jump and make use of pre-decode instruction-boundary finding from one block that included both the branch and the target.

When the instruction is being decoded to uops and fed into the front-end, register values aren't available; those are only available in the out-of-order execution back-end.

The major problem is that when the instructions after .LBB1_67: execute, the architectural state is different depending on whether the branch was taken or not. And so is the micro-architectural state (RAT = Register Allocation Table).

Either:

  • r9 depends on the sbb/setl result (mov r9d, r8d didn't run)
  • r9 depends on the sbb/setb result (mov r9d, r8d did run)

Conditional branches are called "control dependencies" in computer-architecture terminology. Branch-prediction + speculative execution avoids turning control dependencies into data dependencies. If the je was predicted not taken, the setl result (the old value of r9) is overwritten by mov and is no longer available anywhere.

There's no way to recover from this after detecting a misprediction in the je (actually should have been taken), especially in the general case. Current x86 CPUs don't try to look for the fall-through path rejoining the taken path or figuring out anything about what it does.

If cl wasn't ready for a long time, so a mispredict wasn't discovered for a long time, many instructions after the or dl, r9b could have executed using the wrong inputs. In the general case the only way to reliably + efficiently recover is to discard all work done on instructions from the "wrong" path. Detecting that vpcmpeqb xmm0, [rbx - 16] for example still runs either way is hard, and not looked for. (Modern Intel, since Sandybridge, has a Branch Order Buffer (BOB) that snapshots the RAT on branches, allowing efficient rollback to the branch miss as soon as execution detects it while still allowing out-of-order execution on earlier instructions to continue during the rollback. Before that a branch miss had to roll back to the retirement state.)


Some CPUs for some non-x86 ISAs (e.g. PowerPC I think) have experimented with turning forward branches that skip exactly 1 instruction into predication (data dependency) instead of speculating past them. e.g. Dynamic Hammock Predication for Non-predicated Instruction Set Architectures discusses this idea, and even deciding whether to predicate or not on a per-branch basis. If your branch-prediction history says this branch predicts poorly, predicating it instead could be good. (A Hammock branch is one that jumps forward over one or a couple instructions. Detecting the exactly 1 instruction case is trivial on an ISA with fixed-width instruction words, like a RISC, but hard on x86.)

In this case, x86 has a cmovcc instruction, an ALU select operation that produces one of the two inputs depending on a flag condition. cmove r9d, r8d instead of cmp/je would make this immune to branch mispredictions, but at the cost of introducing a data dependency on cl and r8d for instructions that use r9d. Intel CPU don't try to do this for you.

(On Broadwell and later Intel, cmov is only 1 uop, down from 2. cmp/jcc is 1 uop, and the mov itself is also 1 uop, so in the not-taken case cmov is also fewer uops for the front-end. And in the taken case, a taken branch can introduce bubbles in the pipeline even if predicted correctly, depending on how high throughput the code is: Whether queues between stages can absorb it.)

See gcc optimization flag -O3 makes code slower than -O2 for a case where CMOV is slower than a branch because introducing a data dependency is bad.

Psychodiagnosis answered 2/3, 2019 at 23:54 Comment(6)
According to Henry's blog post on return address prediction (blog.stuffedcow.net/2018/04/ras-microbenchmarks), the BOB structure exists since SnB.Remount
A quick test on Haswell shows that the impact of a conditional jump over a single instruction that is easily predictable on performance is less than 0.0001%.Remount
@HadiBrais: I assume you'd find the same thing for an easily predictable branch with a much longer distance, though.Psychodiagnosis
@HadiBrais - right, the impact of a branch can be easily be zero, but that's not a general statement: it can easily be more too, depending on the surrounding code. Taken branches are a known bottleneck for the front-end and it's easy to find examples where a taken branch slows down code by 100% or more versus the rearranged untaken version.Manipulate
@PeterCordes: thanks for the tips! Didn't know about the Killer BOB. Are there rules of thumb to know when a branch doesn't need to be predicted (e.g. the register rX is ready C cycles before a cmp/je)?Oak
@Stringer: Branches always need to be predicted before they're even decoded, let alone executed. Fetch grabs machine code in 16-byte or 32-byte blocks, and it has to know which block to grab next after the current one to avoid front-end bubbles. (Or similarly for the uop-cache). But if your code isn't bottlenecked on front-end throughput, the miss penalty can be mostly hidden by OoO exec e.g. in a loop where the loop-exit mispredict can be "seen" well ahead of the main loop bottleneck. Avoid stalling pipeline by calculating conditional earlyPsychodiagnosis

© 2022 - 2024 — McMap. All rights reserved.