An expensive jump with GCC 5.4.0
Asked Answered
B

4

172

I had a function which looked like this (showing only the important part):

double CompareShifted(const std::vector<uint16_t>& l, const std::vector<uint16_t> &curr, int shift, int shiftY)  {
...
  for(std::size_t i=std::max(0,-shift);i<max;i++) {
     if ((curr[i] < 479) && (l[i + shift] < 479)) {
       nontopOverlap++;
     }
     ...
  }
...
}

Written like this, the function took ~34ms on my machine. After changing the condition to bool multiplication (making the code look like this):

double CompareShifted(const std::vector<uint16_t>& l, const std::vector<uint16_t> &curr, int shift, int shiftY)  {
...
  for(std::size_t i=std::max(0,-shift);i<max;i++) {
     if ((curr[i] < 479) * (l[i + shift] < 479)) {
       nontopOverlap++;
     }
     ...
  }
...
}

the execution time decreased to ~19ms.

The compiler used was GCC 5.4.0 with -O3 and after checking the generated asm code using godbolt.org I found out that the first example generates a jump, while the second one does not. I decided to try GCC 6.2.0 which also generates a jump instruction when using the first example, but GCC 7 seems to not generate one anymore.

Finding out this way to speed up the code was rather gruesome and took quite some time. Why does the compiler behave this way? Is it intended and is it something the programmers should look out for? Are there any more things similar to this?

Barthol answered 6/12, 2016 at 9:22 Comment(18)
Why does the compiler behave this way? The compiler can do as he wants, as long as the generated code is correct. Some compilers are simply better at optimizations than others.Arni
It would also help to post a full example with code that actually compiles, and a link to the godbolt pages...Photoflood
My guess is that the short-circuit evaluation of && causes this.Photoflood
Note that this is why we also have &.Ressieressler
@Jakub no it's not: the second condition must not be evaluated if the first condition is false. Note the word if in that sentence, which directly translates to a jump. I'm actually surprised the CPU branch predictor doesn't alleviate this problem. Is you data unsorted by chance?Ressieressler
yes, it indeed is unsortedLaparotomy
@Jakub sorting it will most probably increase execution speed, see this question.Ressieressler
@JakubJůza One thing you could try is to use profile-guided optimization.This should help the branch prediction on real code. But if the values are random there is not much you can do.Photoflood
changing to (curr[i] < 479) & (l[i + shift] < 479) might further increase performanceDogmatics
@rubenvb: I suspect (looking at the function name), that sorting the arrays isn't going to be possible. The vector is implementing a function of the index, rather than just holding a bag of values.Chalcanthite
@Ressieressler "must not be evaluated" doesn't actually mean anything for an expression that has no side effects. I suspect that vector does bounds-checking and that GCC can't prove it won't be out of bounds. EDIT: Actually, I don't think you are doing anything to stop i+shift from being out of bounds.Furnishings
@Furnishings it does mean something: the comparison to the right of && needs to be evaluated, which means a fetch from memory of the variable. now, if the statement was something like if(pointer && pointer->something), the "must not be evaluated" does really mean something. Here, the same thing applies: if the operator[] needed to say, access disk or other resources, short-circuiting (i.e. not evaluating the expression to the right of &&) is really wanted and useful behavior. Bounds-checking has nothing to do with it.Ressieressler
@Ressieressler The compiler is allowed to remove superfluous memory access. So it's allowed to turn & or * in the OP's example. In theory it could also optimize && into non branching code, but that'd require proving that l[i + shift] has no side effect, in particular that it doesn't trigger a memory access violation, so this looks like a rather unlikely optimization.Molal
@CodesInChaos: "Triggering a memory access violation" is not a side effect. It's undefined behavior, and thus the compiler can assume it doesn't happen. The problem is likely that std::vector has to diagnose and throw an exception for out-of-bound indices.Radioisotope
@R.. I meant what I said. UB is a concept that applies to C code, not to the assembly the C compiler emits. The compiler needs to make sure it doesn't emit a memory access that triggers an access violation on the right hand side in the short-circuiting case, since the code needs to behave like the memory access isn't happening.Molal
@CodesInChaos: Oh, I read your comment backwards I guess.Radioisotope
What platform are you compiling on? GCC prioritizes code size, not performance, on embedded platforms at least. I have never verified this on a PC, but I would expect it to be the same. GCC is used as the main compiler on many platforms with very limited code space, so it makes sense that it would try to shrink code size as much as possible and making performance optimization a second concern. In most applications, choosing performance first where applicable would not make a huge overall performance difference, but would increase costs where you would need a micro with more memory to hold codeDutyfree
Why is a*b != 0 faster than a != 0 && b != 0?Dogmatics
N
264

The logical AND operator (&&) uses short-circuit evaluation, which means that the second test is only done if the first comparison evaluates to true. This is often exactly the semantics that you require. For example, consider the following code:

if ((p != nullptr) && (p->first > 0))

You must ensure that the pointer is non-null before you dereference it. If this wasn't a short-circuit evaluation, you'd have undefined behavior because you'd be dereferencing a null pointer.

It is also possible that short circuit evaluation yields a performance gain in cases where the evaluation of the conditions is an expensive process. For example:

if ((DoLengthyCheck1(p) && (DoLengthyCheck2(p))

If DoLengthyCheck1 fails, there is no point in calling DoLengthyCheck2.

However, in the resulting binary, a short-circuit operation often results in two branches, since this is the easiest way for the compiler to preserve these semantics. (Which is why, on the other side of the coin, short-circuit evaluation can sometimes inhibit optimization potential.) You can see this by looking at the relevant portion of object code generated for your if statement by GCC 5.4:

    movzx   r13d, WORD PTR [rbp+rcx*2]
    movzx   eax,  WORD PTR [rbx+rcx*2]

    cmp     r13w, 478         ; (curr[i] < 479)
    ja      .L5

    cmp     ax, 478           ; (l[i + shift] < 479)
    ja      .L5

    add     r8d, 1            ; nontopOverlap++

You see here the two comparisons (cmp instructions) here, each followed by a separate conditional jump/branch (ja, or jump if above).

It is a general rule of thumb that branches are slow and are therefore to be avoided in tight loops. This has been true on virtually all x86 processors, from the humble 8088 (whose slow fetch times and extremely small prefetch queue [comparable to an instruction cache], combined with utter lack of branch prediction, meant that taken branches required the cache to be dumped) to modern implementations (whose long pipelines make mispredicted branches similarly expensive). Note the little caveat that I slipped in there. Modern processors since the Pentium Pro have advanced branch prediction engines that are designed to minimize the cost of branches. If the direction of the branch can be properly predicted, the cost is minimal. Most of the time, this works well, but if you get into pathological cases where the branch predictor is not on your side, your code can get extremely slow. This is presumably where you are here, since you say that your array is unsorted.

You say that benchmarks confirmed that replacing the && with a * makes the code noticeably faster. The reason for this is evident when we compare the relevant portion of the object code:

    movzx   r13d, WORD PTR [rbp+rcx*2]
    movzx   eax,  WORD PTR [rbx+rcx*2]

    xor     r15d, r15d        ; (curr[i] < 479)
    cmp     r13w, 478
    setbe   r15b

    xor     r14d, r14d        ; (l[i + shift] < 479)
    cmp     ax, 478
    setbe   r14b

    imul    r14d, r15d        ; meld results of the two comparisons

    cmp     r14d, 1           ; nontopOverlap++
    sbb     r8d, -1

It is a bit counter-intuitive that this could be faster, since there are more instructions here, but that is how optimization works sometimes. You see the same comparisons (cmp) being done here, but now, each is preceded by an xor and followed by a setbe. The XOR is just a standard trick for clearing a register. The setbe is an x86 instruction that sets a bit based on the value of a flag, and is often used to implement branchless code. Here, setbe is the inverse of ja. It sets its destination register to 1 if the comparison was below-or-equal (since the register was pre-zeroed, it will be 0 otherwise), whereas ja branched if the comparison was above. Once these two values have been obtained in the r15b and r14b registers, they are multiplied together using imul. Multiplication was traditionally a relatively slow operation, but it is darn fast on modern processors, and this will be especially fast, because it's only multiplying two byte-sized values.

You could just as easily have replaced the multiplication with the bitwise AND operator (&), which does not do short-circuit evaluation. This makes the code much clearer, and is a pattern that compilers generally recognize. But when you do this with your code and compile it with GCC 5.4, it continues to emit the first branch:

    movzx   r13d, WORD PTR [rbp+rcx*2]
    movzx   eax,  WORD PTR [rbx+rcx*2]

    cmp     r13w, 478         ; (curr[i] < 479)
    ja      .L4

    cmp     ax, 478           ; (l[i + shift] < 479)
    setbe   r14b

    cmp     r14d, 1           ; nontopOverlap++
    sbb     r8d, -1

There is no technical reason it had to emit the code this way, but for some reason, its internal heuristics are telling it that this is faster. It would probably be faster if the branch predictor was on your side, but it will likely be slower if branch prediction fails more often than it succeeds.

Newer generations of the compiler (and other compilers, like Clang) know this rule, and will sometimes use it to generate the same code that you would have sought by hand-optimizing. I regularly see Clang translate && expressions to the same code that would have been emitted if I'd have used &. The following is the relevant output from GCC 6.2 with your code using the normal && operator:

    movzx   r13d, WORD PTR [rbp+rcx*2]
    movzx   eax,  WORD PTR [rbx+rcx*2]

    cmp     r13d, 478         ; (curr[i] < 479)
    jg      .L7

    xor     r14d, r14d        ; (l[i + shift] < 479)
    cmp     eax, 478
    setle   r14b

    add     esi, r14d         ; nontopOverlap++

Note how clever this is! It is using signed conditions (jg and setle) as opposed to unsigned conditions (ja and setbe), but this isn't important. You can see that it still does the compare-and-branch for the first condition like the older version, and uses the same setCC instruction to generate branchless code for the second condition, but it has gotten a lot more efficient in how it does the increment. Instead of doing a second, redundant comparison to set the flags for a sbb operation, it uses the knowledge that r14d will be either 1 or 0 to simply unconditionally add this value to nontopOverlap. If r14d is 0, then the addition is a no-op; otherwise, it adds 1, exactly like it is supposed to do.

GCC 6.2 actually produces more efficient code when you use the short-circuiting && operator than the bitwise & operator:

    movzx   r13d, WORD PTR [rbp+rcx*2]
    movzx   eax,  WORD PTR [rbx+rcx*2]

    cmp     r13d, 478         ; (curr[i] < 479)
    jg      .L6

    cmp     eax, 478          ; (l[i + shift] < 479)
    setle   r14b

    cmp     r14b, 1           ; nontopOverlap++
    sbb     esi, -1

The branch and the conditional set are still there, but now it reverts back to the less-clever way of incrementing nontopOverlap. This is an important lesson in why you should be careful when trying to out-clever your compiler!

But if you can prove with benchmarks that the branching code is actually slower, then it may pay to try and out-clever your compiler. You just have to do so with careful inspection of the disassembly—and be prepared to re-evaluate your decisions when you upgrade to a later version of the compiler. For example, the code you have could be rewritten as:

nontopOverlap += ((curr[i] < 479) & (l[i + shift] < 479));

There is no if statement here at all, and the vast majority of compilers will never think about emitting branching code for this. GCC is no exception; all versions generate something akin to the following:

    movzx   r14d, WORD PTR [rbp+rcx*2]
    movzx   eax,  WORD PTR [rbx+rcx*2]

    cmp     r14d, 478         ; (curr[i] < 479)
    setle   r15b

    xor     r13d, r13d        ; (l[i + shift] < 479)
    cmp     eax, 478
    setle   r13b

    and     r13d, r15d        ; meld results of the two comparisons
    add     esi, r13d         ; nontopOverlap++

If you've been following along with the previous examples, this should look very familiar to you. Both comparisons are done in a branchless way, the intermediate results are anded together, and then this result (which will be either 0 or 1) is added to nontopOverlap. If you want branchless code, this will virtually ensure that you get it.

GCC 7 has gotten even smarter. It now generates virtually identical code (excepting some slight rearrangement of instructions) for the above trick as the original code. So, the answer to your question, "Why does the compiler behave this way?", is probably because they're not perfect! They try to use heuristics to generate the most optimal code possible, but they don't always make the best decisions. But at least they can get smarter over time!

One way of looking at this situation is that the branching code has the better best-case performance. If branch prediction is successful, skipping unnecessary operations will result in a slightly faster running time. However, branchless code has the better worst-case performance. If branch prediction fails, executing a few additional instructions as necessary to avoid a branch will definitely be faster than a mispredicted branch. Even the smartest and most clever of compilers will have a hard time making this choice.

And for your question of whether this is something programmers need to watch out for, the answer is almost certainly no, except in certain hot loops that you are trying to speed up via micro-optimizations. Then, you sit down with the disassembly and find ways to tweak it. And, as I said before, be prepared to revisit those decisions when you update to a newer version of the compiler, because it may either do something stupid with your tricky code, or it may have changed its optimization heuristics enough that you can go back to using your original code. Comment thoroughly!

Nich answered 6/12, 2016 at 10:49 Comment(12)
Out of curiosity, how much faster is the last code (with &)?Watchmaker
Well, there isn't a universal "better". It all depends on your situation, which is why you absolutely have to benchmark when you're doing this kind of low-level performance optimization. As I explained in the answer, if you're on the losing size of branch prediction, mispredicted branches are going to slow your code down a lot. The last bit of code doesn't use any branches (note the absence of j* instructions), so it will be faster in that case. [continued]Nich
However, you also have to be careful trying to universalize this rule, because if branch prediction is successful, then this code will probably be slightly slower, simply because it executes more instructions. How significant the effect will be is almost impossible to say without measuring it, and it very much depends on the interaction of this snippet of code with the code around it. If the processor can do some of this out-of-order, then the additional instructions won't matter so much. If it's on the critical path, you'll see a slow-down. But the worst-case is still OK. @BЈовићNich
"...the humble 8088 (whose extremely small instruction cache and slow fetch times, combined with utter lack of branch prediction, meant that taken branches required the cache to be dumped)" - I don't think the 8088 had any cache to dump.Mancuso
@Mancuso One other feature found in the 8086/8088 was a small 4- or 6-byte instruction cache or queue that prefetched a few instructions before they were executed. - I guess your link refers to data cache.Smollett
Given how upvoted this is, it's a shame that it's missing the fact that the short-circuit prevents some optimizations. (in particular, the compiler can't take advantage of any undefined behavior that would occur in the not-evaluated expression)Rhymester
@8bit Bob is right. I was referring to the prefetch queue. I probably shouldn't have called it a cache, but wasn't terribly worried about phrasing and didn't spend very long trying to recall the specifics, since I didn't figure anyone much cared except for historical curiosity. If you want details, Michael Abrash's Zen of Assembly Language is invaluable. The entire book is available various places online; here's the applicable portion on branching, but you should read and understand the parts on prefetching, too.Nich
@Hurkyl I feel like the entire answer speaks to that question. You're right that I didn't really call it out explicitly, but it seemed like it was long enough already. :-) Anyone who takes the time to read the entire thing should gain a sufficient understanding of that point. But if you think something is missing, or needs more clarification, please don't be bashful about editing the answer to include it. Some people don't like this, but I absolutely don't mind. I added a brief comment about this, along with a modification of my wording as suggested by 8bittree.Nich
@CodyGray Wow!! How can I get to that level of expertise? Do you have any pointers on resources, pathways, must and mustnots to follow in your footsteps?Beaverette
Hah, thanks for the complement, @green. I don't have anything specific to suggest. As with everything, you become an expert by doing, seeing, and experiencing. I've read everything that I can get my hands on when it comes to the x86 architecture, optimization, compiler internals, and other low-level stuff, and I still know only a fraction of everything that there is to know. The best way to learn is to get your hands dirty digging around. But before you can even hope to start, you'll need a solid grasp of C (or C++), pointers, assembly language, and all the other low-level fundamentals.Nich
Most of the really good resources are old. Abrash's Zen of Assembly (linked and mentioned above) is fantastic, but it's from the early 1990s and even then, discussed a processor that was already largely obsolete (the 8088). The fundamentals you can gain from this are still invaluable, mostly the "zen" part, learning how to think, but the modern-day details are going to be different, so there's still a lot of legwork in learning them. And there is increasingly less interest in a world of "webby" scripting languages. Hard for me to find a job that would actually use my expertise.Nich
Thanks for your inputs! I really appreciate that! I started looking at the Duntemann book to have something more recent to play with. But I'll have a look at Abrash's book.Beaverette
R
23

One important thing to note is that

(curr[i] < 479) && (l[i + shift] < 479)

and

(curr[i] < 479) * (l[i + shift] < 479)

are not semantically equivalent! In particular, the if you ever have the situation where:

  • 0 <= i and i < curr.size() are both true
  • curr[i] < 479 is false
  • i + shift < 0 or i + shift >= l.size() is true

then the expression (curr[i] < 479) && (l[i + shift] < 479) is guaranteed to be a well-defined boolean value. For example, it does not cause a segmentation fault.

However, under these circumstances, the expression (curr[i] < 479) * (l[i + shift] < 479) is undefined behavior; it is allowed to cause a segmentation fault.

This means that for the original code snippet, for example, the compiler can't just write a loop that performs both comparisons and does an and operation, unless the compiler can also prove that l[i + shift] will never cause a segfault in a situation it's required not to.

In short, the original piece of code offers fewer opportunities for optimization than the latter. (of course, whether or not the compiler recognizes the opportunity is an entirely different question)

You might fix the original version by instead doing

bool t1 = (curr[i] < 479);
bool t2 = (l[i + shift] < 479);
if (t1 && t2) {
    // ...
Rhymester answered 6/12, 2016 at 11:17 Comment(1)
This! Depending on the value of shift (and max) there is UB here...Gallegos
P
18

The && operator implements short-circuit evaluation. This means that the second operand is only evaluated if the first one evaluates to true. This certainly results in a jump in that case.

You can create a small example to show this:

#include <iostream>

bool f(int);
bool g(int);

void test(int x, int y)
{
  if ( f(x) && g(x)  )
  {
    std::cout << "ok";
  }
}

The assembler output can be found here.

You can see the generated code first calls f(x), then checks the output and jumps to the evaluation of g(x) when this was true. Otherwise it leaves the function.

Using "boolean" multiplication instead forces the evaluation of both operands every time and thus does not need a jump.

Depending on the data, the jump can cause a slow down because it disturbs the pipeline of the CPU and other things like speculative execution. Normally branch prediction helps, but if your data is random there is not much which can be predicted.

Photoflood answered 6/12, 2016 at 10:6 Comment(8)
Why do you state that multiplication forces the evaluation of both operands every time? 0*x=x*0=0 regardless of the value of x. As optimization, the compiler may "shortcircuit" the multiplication as well. See stackoverflow.com/questions/8145894/… , for example. Moreover, unlike with the && operator, the multiplication may be lazy-evaluated either with the first or with the second argument, allowing more freedom for optimization.Waynant
@Photoflood - "Normally branch prediction helps, but if your data is random there is not much which can be predicted." - makes the good answer.Coincide
@Waynant Ok, the compiler is of course free to do any optimization which keep the observable behavior. This may or may not transform it and leave out computations. if you compute 0 * f() and f has observable behavior the compiler has to do call it. The difference is that short-circuit evaluation is mandatory for && but allowed if it can shown that it is equivalent for *.Photoflood
@Waynant only in the cases that the 0 value can be predicted from a variable or constant. I guess these cases are very very few. Certainly the optimization cannot be done in the case of the OP, as array access is involved.Costrel
@Jens: Short-circuit evaluation is not mandatory. The code is only required to behave as if it short circuits; the compiler is allowed to use any means it likes to achieve the result.Rhymester
@Photoflood Yes, and in the sample given here the observable behavior doesn't change (most likely, assuming operators [] and < don't have some weird overload)Waynant
@DiegoSevilla Why do you think there is a limitation to prediction?Waynant
@Hurkyl Maybe I did not explicitly say that, but the as-if rule always applies. If the compiler can prove that the observable behavior does not change it can do anything. But that is not special about ´&&´, multiplication or anything else.Photoflood
D
-2

This might be because when you are using the logical operator && the compiler has to check two conditions for the if statement to succeed. However in the second case since you are implicitly converting an int value to a bool, the compiler makes some assumptions based on the types and values being passed in, along with (possibly) a single jump condition. It is also possible that the compiler completely optimizes away the jmps with bit shifts.

Deglutition answered 6/12, 2016 at 9:45 Comment(1)
The jump comes from the fact that the second condition is evaluated if and only if the first is true. The code must not evaluate it otherwise, hence the compiler can't optimize this any better and still be correct (unless it could deduce the first statement will always be true).Ressieressler

© 2022 - 2024 — McMap. All rights reserved.