Does GCC generate suboptimal code for static branch prediction?
Asked Answered
M

2

34

From my university course, I heard, that by convention it is better to place more probable condition in if rather than in else, which may help the static branch predictor. For instance:

if (check_collision(player, enemy)) { // very unlikely to be true
    doA();
} else {
    doB();
}

may be rewritten as:

if (!check_collision(player, enemy)) {
    doB();
} else {
    doA();
}

I found a blog post Branch Patterns, Using GCC, which explains this phenomenon in more detail:

Forward branches are generated for if statements. The rationale for making them not likely to be taken is that the processor can take advantage of the fact that instructions following the branch instruction may already be placed in the instruction buffer inside the Instruction Unit.

next to it, it says (emphasis mine):

When writing an if-else statement, always make the "then" block more likely to be executed than the else block, so the processor can take advantage of instructions already placed in the instruction fetch buffer.

Ultimately, there is article, written by Intel, Branch and Loop Reorganization to Prevent Mispredicts, which summarizes this with two rules:

Static branch prediction is used when there is no data collected by the microprocessor when it encounters a branch, which is typically the first time a branch is encountered. The rules are simple:

  • A forward branch defaults to not taken
  • A backward branch defaults to taken

In order to effectively write your code to take advantage of these rules, when writing if-else or switch statements, check the most common cases first and work progressively down to the least common.

As I understand, the idea is that pipelined CPU may follow the instructions from the instruction cache without breaking it by jumping to another address within code segment. I am aware, though, that this may be largly oversimplified in case modern CPU microarchitectures.

However, it looks like GCC doesn't respect these rules. Given the code:

extern void foo();
extern void bar();

int some_func(int n)
{
    if (n) {
        foo();
    }
    else {
        bar();
    }
    return 0;
}

it generates (version 6.3.0 with -O3 -mtune=intel):

some_func:
        lea     rsp, [rsp-8]
        xor     eax, eax
        test    edi, edi
        jne     .L6            ; here, forward branch if (n) is (conditionally) taken
        call    bar
        xor     eax, eax
        lea     rsp, [rsp+8]
        ret
.L6:
        call    foo
        xor     eax, eax
        lea     rsp, [rsp+8]
        ret

The only way, that I found to force the desired behavior is by rewriting the if condition using __builtin_expect as follows:

if (__builtin_expect(n, 1)) { // force n condition to be treated as true

so the assembly code would become:

some_func:
        lea     rsp, [rsp-8]
        xor     eax, eax
        test    edi, edi
        je      .L2             ; here, backward branch is (conditionally) taken
        call    foo
        xor     eax, eax
        lea     rsp, [rsp+8]
        ret
.L2:
        call    bar
        xor     eax, eax
        lea     rsp, [rsp+8]
        ret
Moschatel answered 26/1, 2017 at 18:49 Comment(13)
https://mcmap.net/q/14994/-how-do-the-likely-unlikely-macros-in-the-linux-kernel-work-and-what-is-their-benefit/905902 The linux kernel uses macros (alling the __builtin_expect) to use the a priory knowledge about the conditional branches.Chihuahua
Modern Intel CPUs don't use static branch prediction. I also don't think GCC promises anywhere to consider the "true" clause of an if/else statement to be the most likely alternative. You're supposed to use __builtin_expect, like wildplasser mentioned, to tell it which is more likely. Or better yet, profile guided optimization.Desiderata
@RossRidge: Could you point to some reference? The Intel's article, that I linked also says that Static branch prediction is used by the microprocessor the first time a conditional branch is encountered, and dynamic branch prediction is used for succeeding executions of the conditional branch code., so I guess there might be some overhead at the beginning, due to missed static branch prediction.Moschatel
See Anger Fog's microarchitecture manual. Section 3.16 "Static Prediction in PM and Core 2": "These processors do not use static prediction. The predictor simply makes a random prediction the first time a branch is seen, depending on what happens to be in the BTB entry that is assigned to the new branch.". agner.org/optimizeDesiderata
If your program is short enough that you can notice the difference from a single bad branch prediction, then thinking about optimization is a waste of time and brain cells.Arnica
@rici: You are right. The example is oversimplified, so it's easier to analyze. For larger program, which contains a lot of branches (especially if it exceeds limits of CPU branch prediction cache), the difference may become significant.Moschatel
Even in a full scale program its unlikely to matter. Unless you're using a processor with only static prediction most jumps are going to be dynamically predicted.Desiderata
@grzegorz: i was reacting to "I guess there might be some overhead at the beginning, due to missed static branch prediction" (emphasis added) and i think the same principle applies. One thing that sometimes invalidates branch prediction caches is misguidedly unrolling loops thereby increasing the number of branch points. I know that's not directly relevant here but I offer it as a cautionary tale.Arnica
As with any optimization as low-level as this, you have to have fixed target platform, and profile the real production code. Any simplification (profiling of simplified sample) or theoretical reasoning is useless at this level, as the final performance is result of many interconnected tiny details and rules, fixing one of them may break 5 others, in the end ruining the performance. While it certainly it is possible to reason about this for human brain, and even to figure correct optimal way, it would take incredible budget of reasoning. Not worth it, when you can simply write code and profile.Handicapped
And compilers of course do generate sub-optimal machine code. Nobody wants to compile with NP-full algorithms just to gain 1-10% performance. As the "optimal" code is NP-full type of problem.Handicapped
For some reason, gcc's profile_estimate pass guesses that n has 54% chances of being 0... (see -fdump-tree-all-all) Normally it has a heuristic that == is more likely false, but it doesn't seem used here. You could file it on gcc's bugzilla to ask about it. Note that if you compile with -fprofile-generate, then run your program, then recompile with -fprofile-use, gcc will have access to real statistics and make better decisions.Seward
gcc.gnu.org/bugzilla/show_bug.cgi?id=79489 contains a detailed analysis, I highly recommend reading Martin's comment.Seward
How are the rules obeyed/disobeyed when a chain of branches is executed?Unsling
D
4

The short answer: no, it is not.

GCC does metrics ton of non trivial optimization and one of them is guessing branch probabilities judging by control flow graph.

According to GCC manual:

fno-guess-branch-probability

Do not guess branch probabilities using heuristics.

GCC uses heuristics to guess branch probabilities if they are not provided by profiling feedback (-fprofile-arcs). These heuristics are based on the control flow graph. If some branch probabilities are specified by __builtin_expect, then the heuristics are used to guess branch probabilities for the rest of the control flow graph, taking the __builtin_expect info into account. The interactions between the heuristics and __builtin_expect can be complex, and in some cases, it may be useful to disable the heuristics so that the effects of __builtin_expect are easier to understand.

-freorder-blocks may swap branches as well.

Also, as OP mentioned the behavior might be overridden with __builtin_expect.

Proof

Look at the following listing.

void doA() { printf("A\n"); }
void doB() { printf("B\n"); }
int check_collision(void* a, void* b)
{ return a == b; }

void some_func (void* player, void* enemy) {
    if (check_collision(player, enemy)) {
        doA();
    } else {
        doB();
    }
}

int main() {
    // warming up gcc statistic
    some_func((void*)0x1, NULL);
    some_func((void*)0x2, NULL);
    some_func((void*)0x3, NULL);
    some_func((void*)0x4, NULL);
    some_func((void*)0x5, NULL);
    some_func(NULL, NULL);
    return 0;
}

It is obvious that check_collision will return 0 most of the times. So, the doB() branch is likely and GCC can guess this:

gcc -O main.c -o opt.a
objdump -d opt.a

The asm of some_func is:

sub    $0x8,%rsp
cmp    %rsi,%rdi
je     6c6 <some_func+0x18>
mov    $0x0,%eax
callq  68f <doB>
add    $0x8,%rsp
retq   
mov    $0x0,%eax
callq  67a <doA>
jmp    6c1 <some_func+0x13>

But for sure, we can enforce GCC from being too smart:

gcc -fno-guess-branch-probability main.c -o non-opt.a
objdump -d non-opt.a

And we will get:

push   %rbp
mov    %rsp,%rbp
sub    $0x10,%rsp
mov    %rdi,-0x8(%rbp)
mov    %rsi,-0x10(%rbp)
mov    -0x10(%rbp),%rdx
mov    -0x8(%rbp),%rax
mov    %rdx,%rsi
mov    %rax,%rdi
callq  6a0 <check_collision>
test   %eax,%eax
je     6ef <some_func+0x33>
mov    $0x0,%eax
callq  67a <doA>
jmp    6f9 <some_func+0x3d>
mov    $0x0,%eax
callq  68d <doB>
nop
leaveq 
retq  

So GCC will leave branches in source order.

I used gcc 7.1.1 for those tests.

Deposit answered 9/10, 2017 at 10:53 Comment(1)
To be fair, you should compile both versions with the same -O flag, and then include -fno-guess-branch-probability in the second. The code without optimization is totally different than the first one with -O and you can't really conclude it's just the -fno-guess-branch-probability flag that changed the block order because there a dozens of other flags and optimizations applied to the former but not the latter.Caiaphas
D
-1

I Think That You've Found A "Bug"

The funny thing is that optimization for space and no optimization are the only cases in which the "optimal" instruction code is generated: gcc -S [-O0 | -Os] source.c

some_func:
FB0:
       pushl   %ebp
       movl    %esp, %ebp
       subl    $8, %esp
       cmpl    $0, 8(%ebp)
       je      L2
       call    _foo
       jmp     L3
2:
       call    _bar
3:
       movl    $0, %eax
       # Or, for -Os:
       # xorl    %eax, %eax
       leave
       ret

My point is that ...


some_func:
FB0:
       pushl   %ebp
       movl    %esp, %ebp
       subl    $8, %esp
       cmpl    $0, 8(%ebp)
       je      L2
       call    _foo

... up to & through the call to foo everything is "optimal", in the traditional sense, regardless of the exit strategy.

Optimality is ultimately determined by the processor, of course.

Disendow answered 27/7, 2017 at 22:42 Comment(9)
Really? Another down vote with no explanation? How does that help anybody? This is the exact assembly output (minus directives) and it is optimal for traditional CPUs -- it jumps on the else, not the true condition. The corresponding options to gcc are counter-intuitive. There is no optimal code, in this situation, for modern x86 CPUs.Disendow
The entire function you show definitely isn't optimal for any CPU. It uses mov $0, %eax to zero the return value, instead of xor %eax,%eax. (This must be the -O0 output, not the -Os output.) Also, tail duplication would avoid the jmp L3 on the call foo path. Also, no point setting up a frame pointer. Not downvoting because it's at least interesting to see how gcc lays out branches here.Loyal
We're specifically speaking about branch prediction. So that entails a jmp instruction, at some point. The point is that there is no jmp to call foo... the jmp is to call bar. And the only difference between -O0 and -Os, in this sitch, is how zero is returned to the calling environment. --The code is optimal for (older) x86 CPUs using traditional branch prediction (ie. no jmp when taking the then case).Disendow
I'm talking about the jmp, not the je. Of course you need a je (unless you branchlessly choose between two function pointers), but the code following the je could be call _foo / xor %eax,%eax / leave / ret to avoid a jmp. Tail duplication increases code size but decreases dynamic instruction count, and decreases the amount of taken jumps. For the call _foo case, there are no taken jmp or jcc instructions.Loyal
Branch prediction, in this case, is concerned with the jump before the call. The jmp L3 is, to my knowledge, just a necessary side-effect; to get to the exit code.Disendow
Even unconditional branches need to be predicted before they're decoded to avoid decode bubbles. They can have at least a small negative impact on the front-end. And no, you don't need them. See the asm in the question, which does exactly the kind of tail-duplication I was talking about. Each call is followed by a duplicated block of a couple instructions ending with ret, instead of using a jmp to a shared tail. (And BTW, I haven't downvoted this answer, even though I disagree that it's a bug. It's a somewhat interesting data point.)Loyal
Thank you for taking the time to discuss this Peter. I respect your opinion. So, directing the conversation back towards the point which I'm attempting to make (along with the OP), why does -O3 jump to foo while -Os / -O0 do not? Would a traditional x86 not attempt to prefetch the else (bar) instead of the then (foo), in such case? As is, it seems to contradict the expected behavior.Disendow
As @ivaigult's answer explains gcc guesses that if(n) will be false more often than true, so it lays out the branches on purpose to put call bar on the path with no taken branches. This is the intended behaviour. If you think you know which way a branch will go, you can use __builtin_expect (or a likely / unlikely wrapper macro) to tell gcc about it.Loyal
I guess -Os doesn't enable -fguess-branch-probability, or else it makes a different guess than -O3. (I just checked on godbolt: -Os does include -fguess-branch-probability. godbolt.org/g/WLiZGz) Anyway, gcc's documentation is pretty that it doesn't just assume the first part of an if/else is the fast-path.Loyal

© 2022 - 2024 — McMap. All rights reserved.