"Right branch of if" which I assumes is the branch "if" is going to take if the condition is met.
No, it's talking about correct branch prediction, vs. lower down the "wrong branch of if (branch misprediction)" entry in the table. This is a thing whether there's an else
or not in the C source.
A conditional branch needs to be predicted by the front-end to not stall, but guessing wrong means it has to rewind and discard mis-speculated instructions. This applies whether there's an else
part or not. See Mysticial's famous answer on Why is processing a sorted array faster than processing an unsorted array?
An else
just requires an extra unconditional branch in one block of instruction to not run the other. (In your case, that's the jmp .L3
). The other option is to put one block somewhere else (e.g. after the ret
at the bottom of the function) so there are no taken branches on the fast path, but the other path has a taken jcc
and a taken jmp
back to the rejoin the other path after the if/else.
As you can see the right branch which calls printf for "a > 5" lives closer to the cmp - instruction. Looking at the data cache a situation might therefore arise where the instructions to call printf are already loaded in a cache line whereas the else branch must first be fetched from L2 or L3 cache. Could this (in theory) lead to a tiny speed up of the "Right" branch before the branch prediction pattern is established?
Yes, branch layout for I-cache locality is a thing, mostly or totally separate from branch prediction.
As you say, I-cache locality is a thing, so putting the rarely-executed block down at the end of the function (after the ret
) is something a compiler can do if it knows (or you tell it with [[likely]]
or [[unlikely]]
) that one side of an if/else is essentially "cold".
Compilers have various heuristics for guessing whether an if
condition is normally true or not, so they can still make a guess even without data from training runs (for profile-guided optimization). It can lay out the machine code with the if
or the else
part first.
Also, the not-taken fall-through path of a branch is often slightly cheaper than the taken side, in terms of front-end throughput. This is assuming correct prediction. Modern CPUs fetch machine code in large contiguous blocks, like 16 or 32 bytes at a time, but on a taken branch they might not get to use all of those bytes, and have to do another fetch before they can see more instructions.
A few ISAs have had machine-code branch hints that compilers can use to tell the hardware branch predictor which way a branch is likely to go, but other than that, there's no mechanism for profile-guided optimization or [[likely]]
/ [[unlikely]]
or GNU C __builtin_expect
to influence branch prediction. Other than static prediction via layout on CPUs that fall back to static prediction; that doesn't include Intel since Sandybridge: Why did Intel change the static branch prediction mechanism over these years?.
See also:
BTW, GCC didn't do a great job here. Both sides of the if/else call printf
and store to a
, just with different values. And it's already using a push/pop for stack alignment, so it could just save/restore RBX to have somewhere to store the compare result across the printf. It could even have made branchless code, using cmov
to select between two format-string pointers.
hand_written_main:
push rbx # realign the stack and save RBX
mov ebx, 2
mov edi, OFFSET FLAT:.LC0 # we can set regs unconditionally, no else needed
cmp dword ptr [RIP+a], 5 # splitting to mov+cmp might be more efficient for macro-fusion of cmp/jcc and maybe micro-fusion
jle .L2
mov edi, OFFSET FLAT:.LC1 # or RIP-rel LEA in a PIE
mov ebx, 3
.L2:
xor eax, eax # number of XMM args in regs for variadic functions
call printf
mov DWORD PTR a[rip], ebx
pop rbx
ret
Interesting to note that this has more instructions executed for one side of the branch, since mov ebx, imm32
and mov edi, imm32
execute unconditionally, and then we run them again if the branch isn't taken. This is like int ebx = 2;
/ if(a<=5) ebx=3;
instead of the =2
in an else
. It's a tradeoff of keeping the branching simpler (no jmp
anywhere) vs. running an extra instruction. Modern x86 CPUs are pretty wide, and the extra instructions are independent of anything so there's instruction-level parallelism here.
And the fun branchless way, packing the strings together 8 bytes apart so we can generate the address of one or the other with a single addressing mode.
hand_written_main:
push rbx # realign the stack and save RBX
xor ebx, ebx
cmp dword ptr [RIP+a], 5
setle bl # bl = (a<=5)
lea edi, [gt5msg + rbx*8] # In a PIE, we'd need [rdi + rbx*8] with a separate RIP-relative LEA
add bl, 2 # 2 + (a<=5) = 2 or 3
xor eax, eax # number of XMM args in regs for variadic functions
call printf
mov DWORD PTR a[rip], ebx
pop rbx
ret
.section .rodata
.p2align 3
gt5msg: .asciz "a > 5!" # <= 8 bytes long, in fact 7 so there's an extra 1 byte of padding.
.p2align 3
le5msg: .asciz "a <= 5!" # starts exactly 8 bytes after the previous message