Performance difference between branch prediction and branch target prediction?
Asked Answered
D

2

10

I'm writing some audio code where basically everything is a tiny loop. Branch prediction failures as I understand them is a big enough performance issue that I struggle to keep the code branch free. But there is only so far that can take me, which got me wondering about the different kinds of branching.

In c++, the conditional branch to fixed target:

int cond_fixed(bool p) {
    if (p) return 10;
    return 20;
}

And (if I understand this question correctly), the unconditional branch to variable target:

struct base {
    virtual int foo() = 0;
};

struct a : public base {
    int foo() { return 10; }
};

struct b : public base {
    int foo() { return 20; }
};

int uncond_var(base* p) {
    return p->foo();
}

Are there performance differences? It seems to me that if one of the two methods were obviously faster than the other, the compiler would simply transform the code to match.

For those cases where branch prediction is of very high importance, what details regarding performance are useful to know?

EDIT: The actual operation of x : 10 ? 20 is merely a place holder. The actual operation following the branch is at least complex enough that doing both is inefficient. Additionally, if I had enough information to sensibly use __builtin_expect, branch prediction would be a non-issue in this case.

Deus answered 13/2, 2014 at 13:20 Comment(4)
Which CPU ? Which compiler ? Did you check out the assembly to know which of the two strategies was picked ?Gamin
Note: the compiler cannot transform the latter uncond_var because it does not know the complete set of possible derived classes of base. In general closed problems (finite number of possible inputs) are easier to solve than open ones.Gamin
@MatthieuM. Compiler GCC, CPU anything from desktop to smartphones, though a modern desktop CPU is my current concern. Also, it seems odd to me that the compiler doesn't know all possible derived classes of base. It has all the source code, so this information exists. And no, I am not familiar enough with assembly to feel productive diving into such details. That's why I turn to this site, to hopefully get a higher level understanding from someone that knows such details.Deus
Regarding the CPU: some CPUs may not have predictors (or maybe only one kind); so the question is not meaningful for all CPUs. Desktop CPUs (x86/x86_64) should have both. Regarding the knowledge available to the compiler: in theory it could, in practice this information is only available if you look at the whole program at once. The compiler front-end (language aware) will not look at the whole program at once, and the optimizer (middle of the chain) might if you specify LTO (Link Time Optimization) or compile a static executable... but knows nothing about classes. Consider it will not happen.Gamin
H
4

Side note: if you have a code like

if (p) a = 20; else a = 10;

then there isn't any branch. The compiler is using a conditional move (see: Why is a conditional move not vulnerable for Branch Prediction Failure?)

Hocuspocus answered 13/2, 2014 at 13:28 Comment(6)
I was hoping it was clear from the question that this isn't about all the small details one can use to avoid branches. The question isn't about p ? 10 : 20 - it is merely an example. I will make an edit to make this explicit.Deus
That's why I said by post was a "side note". Sorry, if several people see it as an answer.Hocuspocus
@Hocuspocus Maybe it is because you posted is as an answer.Sansbury
@Hocuspocus I did learn something new, so that is nice. But yeah, this should absolutely be a comment, not an answer.Deus
oh @hivert, no offense meant, by the way, in case my comment reads offensiveSansbury
@porgarmingduod: The problem is that you are asking about very low level performance, and that depends hugely on what operations are performed, but you are not mentioning those. The question can either be answered based on what you provide or not be answered at all and should be deleted. Pick your choice. The third alternative is the obvious answer: MEASUREBloodsucker
R
1

You didn't mention your compiler. I once used GCC for a performance critical application (a contest at my university actually) and I remember that GCC has the __builtin_expect macro. I went through all the conditions in my code and ended up with 5-10% speedup, which I found to be amazing, given the fact that I payed attention to pretty much everything I knew (memory-layout etc.) and that I didn't change anything regarding the algorithm itself.

The algorithm was a pretty basic depth-search by the way. And I ran it on a Core 2 Duo, not sure which ones though.

Remunerative answered 13/2, 2014 at 13:27 Comment(2)
Note: what does __builtin_expect does ? Given this hint, the compiler optimizes two things: 1/ it can rig the prediction and 2/ it can lay out the code so that the likely block immediately follows the current block (to minimize cache misses). The one caveat, obviously, is that work-loads that do behave as hinted will suffer; so if you use this built-in, you better make sure to have identified possible outliers.Gamin
Even better than using the __builtin_expect would be to use -fprofile-generate and -fprofile-use to let profiler figure out which branch should be optimized gcc options. This should take the human error out of equation and won't introduce nonportable code. The caveat is that insufficient profiling will produce nonoptimal code.Isadoraisadore

© 2022 - 2024 — McMap. All rights reserved.