Is the if-branch faster than the else branch?
Asked Answered
O

3

7

I came across this very nice infographic which gives a rough estimation about the CPU-cylces used for certain operations. While studying I noticed an entry "Right branch of if" which I assumes is the branch "if" is going to take if the condition is met (EDIT: As pointed out in the comments "Right" actually means "a rightly predicted branch"). It made me wonder if there's any (even so minor) speed difference in the if-branch compared to the else branch.

Compare for example the following very condensed code:

Demo

#include <cstdio>

volatile int a = 2;

int main()
{
    if (a > 5) {
        printf("a > 5!");
        a = 2;
    } else {
        printf("a <= 5!");
        a = 3;
    }
}

Which generates this assembly in x86 64bit:

.LC0:
        .string "a > 5!"
.LC1:
        .string "a <= 5!"
main:
        push    rcx
        mov     eax, DWORD PTR a[rip]
        cmp     eax, 5
        jle     .L2
        mov     edi, OFFSET FLAT:.LC0
        xor     eax, eax
        call    printf
        mov     DWORD PTR a[rip], 2
        jmp     .L3
.L2:
        mov     edi, OFFSET FLAT:.LC1
        xor     eax, eax
        call    printf
        mov     DWORD PTR a[rip], 3
.L3:
        xor     eax, eax
        pop     rdx
        ret
a:
        .long   2

As you can see the right branch which calls printf for "a > 5" lives closer to the cmp - instruction. Looking at the data cache a situation might therefore arise where the instructions to call printf are already loaded in a cache line whereas the else branch must first be fetched from L2 or L3 cache. Could this (in theory) lead to a tiny speed up of the "Right" branch before the branch prediction pattern is established?

Ottillia answered 8/10, 2022 at 13:29 Comment(4)
A few lines down in the same inforgraphic you can see "Wrong branch of if (branch misprediction)". Nothing to do with whether the condition is met or not, but rather if the condition result is predicted or not.Westney
the article does not formally define what they mean with "Right branch", but in the paragraph about branching you find "if the CPU guesses correctly where the execution will go (that’s before actually calculating condition of if), then cost of branching operation is about 1-2 CPU cycles." And thats the 1-2 cycles denoted by "Right branch" in the graphBreak
Is the if-branch faster than the else branch? No, the predicted branch is faster than the mispredicted branch.Situla
You assume “right” branch means the true branch. The article implies the “right” branch is the correctly predicted branchParkin
N
5

Put it shortly, no on mainstream processors, but yes on some old/embedded processors.

Modern mainstream processors are very good at predicting conditions (assuming they are not executed for the first time or the test can be done early). When a branch is correctly predicted, there is almost no cost (beside the use of branching units executed on some specific ports). The processor can speculatively fetch and execute the instructions of the predicted conditional block. When the branch is not correctly predicted, the processor must perform a kind of rollback operation that is pretty expensive (typically 5-15 cycles).

Some embedded processors and old ones use static prediction algorithms. For example, they can assume that the branch is never taken. On such processors, executing the if block is typically faster than the else one assuming the compiler does not reorder the blocks (which is quite frequent when optimizations are enabled). Built-ins hints can be provided by developers so to help the compiler generate a code that more efficiently executes by processors using static partitioning. Profile guided optimizations can be used to automatically find conditions that are often true/false and reorder branches accordingly for sake of performance.

Mainstream (server, desktop and high-end mobile) processors use mainly dynamic prediction algorithms. For example, they can track and memorize when a branch is taken or not so to know the probability of the branch to be taken in the future. A processor typically has a limited set of conditional branches tracked. Thus, when a code has too many (imbricated) conditional branch instructions, a static partitioning algorithm can be used as a fallback method instead.

It is worth mentioning that the prediction information can be reset/flushed in some cases, like when a process is pre-empted (as pointed out by @HenriqueBucher). This means prediction can be far less effective when there are many context-switches. Note that speculation can be partially controlled by some specific instruction set so to mitigate vulnerabilities like Spectre.

A jump to a far unpredicted location can be expensive since the processor needs to fetch instructions that may not be in the instruction cache. In your case, it certainly does not matter on mainstream x86-64 processors since a recent x86-64 processor is expected to load all the instructions of the program in the cache pretty quickly. For example, a Skylake processor can fetch 16 bytes/cycle from the instruction cache while a Zen 2 processor reaches 32 bytes/cycle. Both can load 64-byte cache lines to the instruction cache.

There is not a lot of public information about the branch prediction algorithm of recent Intel processors. The one of AMD Zen 2+ processors is pretty well documented: it uses an efficient TAGE predictor combined with a Perceptron so to predict the outcome of a condition based on past statistics. You can find detailed information about its behaviour here.

Newsreel answered 8/10, 2022 at 14:54 Comment(2)
One point that dynamic prediction is reset as soon as the process is preempted. At that point the branch prediction table has to be rebuilt so static prediction rules still count at thread restart.Rye
@HenriqueBucher Good point indeed. I added this in the answer. Thank you.Fortunna
J
3

No, there's no real difference between the if and else branch, and that's not what the graphic you linked means by the "right" branch. By "right", it means that the CPU's branch prediction correctly guessed which branch would be taken before the comparison was done.

Modern CPUs are pipelined. They start working on the next instruction before they're done with the current one. This can significantly improve performance, since it means that the CPU can keep all of its different sub-components (the instruction decoder, different parts of the ALU, the memory controller, etc) busy all the time instead of leaving them idle while another component of the CPU does work. I.e. it can be performing an addition, fetching an operand from memory, and decoding an instruction all at the same time.

This pipelining depends on knowing what instruction will be executed next though, and conditional branch instructions throw a wrench in it. In your example, the CPU doesn't know if the next instruction that will need to be executed after jle .L2 is mov edi, OFFSET FLAT:.LC0 or mov edi, OFFSET FLAT:.LC1 until after the cmp instruction is done, so it guesses. It starts working on one of the branches, and when the cmp finally finishes it looks to see if it guessed right. If it did, great, the partial work it's done on the following instructions is valid and it keeps going like normal. If it guessed wrong though, all of the work it did on instructions after the jle has to be thrown away, and it starts working on the other branch, which costs some time. An occasional incorrect guess won't make a noticeable performance difference, but if it guesses wrong very often it can start to add up and make a big difference.

See also the highest rated answer on StackOverflow.


Note that on some older or embedded CPUs the branch prediction algorithm is essentially just "always guess the first one", so on such CPUs the else path would be slower since branch prediction would never guess that by default. On those CPUs GCC's __builtin_expect or C++20's [[likely]] attributes can be useful to tell the compiler which branch is more likely so that it will generate assembly code to make the more likely path "the fist one", even if it's the else.

Jacquard answered 8/10, 2022 at 14:42 Comment(6)
This is not true. In older CPUs it made a ton of difference, to the point that the __builtin_expect and respective likely macro and the likely attribute in C++20 were created gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html en.cppreference.com/w/cpp/language/attributes/likelyRye
@HenriqueBucher I'm not sure what part you're trying to correct? Things like __builtin_expect are inputs into the guess that the branch prediction algorithm makes. Unless you mean that the else is slower on a CPU where the branch prediction algorithm is essentially "always guess true"?Jacquard
Yes the if branch is computed through speculation, not the else. Modern cpus use dynamic prediction though but older ones are like thatRye
@HenriqueBucher Added a note about that.Jacquard
Things like __builtin_expect are inputs into the guess that the branch prediction algorithm makes - No, not at all. __builtin_expect influences the compiler's decision on branch layout, which site will have a taken vs. not-taken branch (which is a smaller difference than predicted vs. mispredicted, but I-cache locality and pipeline effects are still a thing). Also on the decision to do if-conversion to branchless if it's possible. Compile time branch hints don't have any mechanism (on most ISAs) to even hint run-time prediction as. (@HenriqueBucher)Volar
See Is it possible to tell the branch predictor how likely it is to follow the branch?. And for a code-gen example, What is the advantage of GCC's __builtin_expect in if else statements?Volar
V
2

"Right branch of if" which I assumes is the branch "if" is going to take if the condition is met.

No, it's talking about correct branch prediction, vs. lower down the "wrong branch of if (branch misprediction)" entry in the table. This is a thing whether there's an else or not in the C source.

A conditional branch needs to be predicted by the front-end to not stall, but guessing wrong means it has to rewind and discard mis-speculated instructions. This applies whether there's an else part or not. See Mysticial's famous answer on Why is processing a sorted array faster than processing an unsorted array?

An else just requires an extra unconditional branch in one block of instruction to not run the other. (In your case, that's the jmp .L3). The other option is to put one block somewhere else (e.g. after the ret at the bottom of the function) so there are no taken branches on the fast path, but the other path has a taken jcc and a taken jmp back to the rejoin the other path after the if/else.


As you can see the right branch which calls printf for "a > 5" lives closer to the cmp - instruction. Looking at the data cache a situation might therefore arise where the instructions to call printf are already loaded in a cache line whereas the else branch must first be fetched from L2 or L3 cache. Could this (in theory) lead to a tiny speed up of the "Right" branch before the branch prediction pattern is established?

Yes, branch layout for I-cache locality is a thing, mostly or totally separate from branch prediction.

As you say, I-cache locality is a thing, so putting the rarely-executed block down at the end of the function (after the ret) is something a compiler can do if it knows (or you tell it with [[likely]] or [[unlikely]]) that one side of an if/else is essentially "cold".

Compilers have various heuristics for guessing whether an if condition is normally true or not, so they can still make a guess even without data from training runs (for profile-guided optimization). It can lay out the machine code with the if or the else part first.

Also, the not-taken fall-through path of a branch is often slightly cheaper than the taken side, in terms of front-end throughput. This is assuming correct prediction. Modern CPUs fetch machine code in large contiguous blocks, like 16 or 32 bytes at a time, but on a taken branch they might not get to use all of those bytes, and have to do another fetch before they can see more instructions.

A few ISAs have had machine-code branch hints that compilers can use to tell the hardware branch predictor which way a branch is likely to go, but other than that, there's no mechanism for profile-guided optimization or [[likely]] / [[unlikely]] or GNU C __builtin_expect to influence branch prediction. Other than static prediction via layout on CPUs that fall back to static prediction; that doesn't include Intel since Sandybridge: Why did Intel change the static branch prediction mechanism over these years?.

See also:


BTW, GCC didn't do a great job here. Both sides of the if/else call printf and store to a, just with different values. And it's already using a push/pop for stack alignment, so it could just save/restore RBX to have somewhere to store the compare result across the printf. It could even have made branchless code, using cmov to select between two format-string pointers.

hand_written_main:
   push    rbx                     # realign the stack and save RBX
   mov     ebx, 2
   mov     edi, OFFSET FLAT:.LC0   # we can set regs unconditionally, no else needed

   cmp dword ptr [RIP+a], 5       # splitting to mov+cmp might be more efficient for macro-fusion of cmp/jcc and maybe micro-fusion
   jle     .L2
   mov     edi, OFFSET FLAT:.LC1   # or RIP-rel LEA in a PIE
   mov     ebx, 3
.L2:

   xor     eax, eax          # number of XMM args in regs for variadic functions
   call    printf
   mov     DWORD PTR a[rip], ebx
   pop     rbx
   ret

Interesting to note that this has more instructions executed for one side of the branch, since mov ebx, imm32 and mov edi, imm32 execute unconditionally, and then we run them again if the branch isn't taken. This is like int ebx = 2; / if(a<=5) ebx=3; instead of the =2 in an else. It's a tradeoff of keeping the branching simpler (no jmp anywhere) vs. running an extra instruction. Modern x86 CPUs are pretty wide, and the extra instructions are independent of anything so there's instruction-level parallelism here.

And the fun branchless way, packing the strings together 8 bytes apart so we can generate the address of one or the other with a single addressing mode.

hand_written_main:
   push    rbx                     # realign the stack and save RBX

   xor     ebx, ebx
   cmp   dword ptr [RIP+a], 5
   setle   bl                       # bl = (a<=5)
   lea     edi, [gt5msg + rbx*8]    # In a PIE, we'd need [rdi + rbx*8] with a separate RIP-relative LEA
   add     bl, 2                    # 2 + (a<=5) = 2 or 3

   xor     eax, eax          # number of XMM args in regs for variadic functions
   call    printf
   mov     DWORD PTR a[rip], ebx
   pop     rbx
   ret

.section .rodata
.p2align 3
 gt5msg: .asciz "a > 5!"       # <= 8 bytes long, in fact 7 so there's an extra 1 byte of padding.
.p2align 3
 le5msg: .asciz "a <= 5!"     # starts exactly 8 bytes after the previous message
Volar answered 8/10, 2022 at 20:16 Comment(1)
That's a great answer thank you very much! On another note: What book can you recommend for learning x86 assembly? Feels like I'm missing out some things if I don't go after it now.Ottillia

© 2022 - 2024 — McMap. All rights reserved.