How to understand the tricky speed up

Asked 28/10, 2013 at 16:12 Answered 28/10, 2013 at 21:46

Sorry for may be too abstract question, but for me it is quite practical + may be some experts had similar experience and can explain it.

I have a big code, about 10000 lines size.

I notices that if in a certain place I put

if ( expression ) continue;

where expression is always false (double checked with logic of code and cout), but depends on unknown parameters (so compiler can't simply rid of this line during compilation) the speed of the program is increased by 25% (the result of calculation are the same). If I measure speed of the loop itself the speed up factor is bigger than 3.

Why can this happen and what is possible ways to use this speed up possibility without such tricks?

P.S. I use gcc 4.7.3, -O3 optimisation.

More info:

I have tried two different expressions, both works.

If I change the line to:

if ( expression ) { cout << " HELLO " << endl; continue; };

the speed up is gone.

If I change the line to:
```
expression;
```
the speed up is gone.

The code, which surrounds the line looks like this:

for ( int i = a; ;  ) {
  do {
    i += d;
    if ( d*i > d*ilast ) break;

      // small amount of calculations, and conditional calls of continue;

  } while ( expression0 );
  if ( d*i > dir*ilast ) break;

  if ( expression ) continue;

   // very big amount calculations, and conditional calls of continue;

}

the for loop looks strange. It is because I have modified the loops in order to catch this bottle neck. Initially expression was equal to expression0 and instead of do-loop I had only this continue.

I tried use __builtin_expect in order to understand branch prediction. With

  // the expression (= false) is supposed to be true by branch prediction.
if ( __builtin_expect( !!(expression), 1) ) continue;

the speed up is 25%.

  // the expression (= false) is supposed to be false by branch prediction.
if ( __builtin_expect( !!(expression), 0) ) continue;

the speed up is gone.

If I use -O2 instead of -O3 the effect is gone. The code is slightly (~3%) slower than the fast O3-version with the false condition.
Same for "-O2 -finline-functions -funswitch-loops -fpredictive-commoning -fgcse-after-reload -ftree-vectorize". With one more option: "-O2 -finline-functions -funswitch-loops -fpredictive-commoning -fgcse-after-reload -ftree-vectorize -fipa-cp-clone" the effect is amplified. With "the line" the speed is same, without "the line" the code is 75% slower.

The reason is in just following conditional operator. So the code looks like this:

for ( int i = a; ;  ) {

      // small amount of calculations, and conditional calls of continue;

  if ( expression ) continue;

    // calculations1

  if ( expression2 ) {
    // calculations2
  }

   // very big amount calculations, and conditional calls of continue;

}

The value of expression2 is almost always false. So I changed it like this:

for ( int i = a; ;  ) {

      // small amount of calculations, and conditional calls of continue;

  // if ( expression ) continue; // don't need this anymore

    // calculations1

  if ( __builtin_expect( !!(expression2), 0 ) ) { // suppose expression2 == false
    // calculations2
  }

   // very big amount calculations, and conditional calls of continue;

}

And have got desired 25% speed up. Even a little bit more. And behaviour no longer depends on the critical line.

If somebody knows materials, which can explain this behaviour without guesses I will be very glad to read and accept their answer.

Hart answered 28/10, 2013 at 16:12 Comment(19)

I doubt there is something we can tell you without SSCCE. – Bougie 28/10, 2013 at 16:16

This is probably platform and compiler specific behavior; this will be very difficult to judge without seeing the surrounding code. (Unless you are mistaken and expression is, in fact, true sometimes.) It would be best to reduce it to a SSCCE. sscce.org – Rangy 28/10, 2013 at 16:18

Without seeing more of the code and a breakdown of the benchmarks you have done (e.g. is the runtime consistent both pre- and post-change in many runs?), it will be difficult to narrow down. – Hypaesthesia 28/10, 2013 at 16:18

@Hart I misread, deleted my comment – Emersonemery 28/10, 2013 at 16:19

Do you have some code after if (expression) which this always false condition jumps through them. – Heterodoxy 28/10, 2013 at 16:19

@JBentley, If I measure speed of the loop itself the speed up factor is bigger than 3. – Hart 28/10, 2013 at 16:20

Does it happen also with -O2? Is -O3 actually faster than -O2? – Plumage 28/10, 2013 at 16:23

What happens if you get rid of the surrounding if ... continue and just write expression? – Mousetrap 28/10, 2013 at 16:24

@Zac Howland, @M M. I have added more information about surrounding code. – Hart 28/10, 2013 at 16:31

@molbdnilo, the speed up is gone. – Hart 28/10, 2013 at 16:33

@klm123: Typo in 5. both branch hints are the same, while they are meant to be opposite. – Rafaelof 28/10, 2013 at 16:50

Sounds like the result of branch prediction. – Callicrates 28/10, 2013 at 16:51

@n.m., it doesn't. See the item 6. in the post. Do you have an idea why O3 makes it differently? – Hart 28/10, 2013 at 16:52

Well, I heard that -O3 and higher levels aren't working that well in gcc. It's prone to producing severely crippled object code under some mysterious conditions. Stick with -O2 for best results. – Plumage 28/10, 2013 at 17:18

If you can make your code public, you can take your part in improving gcc and submit a bug report. – Plumage 28/10, 2013 at 17:21

@Hart To get to the root cause of this you're going to have to inspect the assembly dump for these two variations and see what is being done differently. That's the only way to confirm what is actually happening. Otherwise you're just guessing. – Alloplasm 31/10, 2013 at 7:58

@Hart actually I'm kind of curious about this myself. If you can post the disassembly of both versions here, I'm sure someone can explain the performance difference. Not the entire listing obviously, that would be way to long -- just the surrounding parts of those expressions in the loop. – Alloplasm 31/10, 2013 at 8:11

@greatwolf, sorry I have already modified the code and don't have the described version anymore. – Hart 31/10, 2013 at 8:40

Shouldn't it be in your version control somewhere? o.O – Alloplasm 31/10, 2013 at 12:11

Found it.

The reason was in the just following conditional operator. So the code looks like this:

for ( int i = a; ;  ) {

      // small amount of calculations, and conditional calls of continue;

  if ( expression ) continue;

    // calculations1

  if ( expression2 ) {
    // calculations2
  }

   // very big amount calculations, and conditional calls of continue;

}

The value of expression2 is almost always false. So I changed it like this:

for ( int i = a; ;  ) {

      // small amount of calculations, and conditional calls of continue;

  // if ( expression ) continue; // don't need this anymore

    // calculations1

  if ( __builtin_expect( !!(expression2), 0 ) ) { // suppose expression2 == false
    // calculations2
  }

   // very big amount calculations, and conditional calls of continue;

}

And have got desired 25% speed up. Even a little bit more. And behaviour no longer depends on the critical line.

I'm not sure how to explain it and can't find enough material on branch prediction.

But I guess the point is that calculations2 should be skipped, but compiler doesn't know about this and suppose expression2 == true by default. Meanwhile it suppose that in the simple continue-check

if ( expression ) continue;

expression == false, and nicely skips calculations2 as has to be done in any case. In case when under if we have more complicated operations (for example cout) it suppose that expression is true and the trick doesn't work.

If somebody knows materials, which can explain this behaviour without guesses I will be very glad to read and accept their answer.

Hart answered 28/10, 2013 at 20:47 Comment(0)

The introduction of that impossible-to-reach branch breaks up the flow graph. Normally the compiler knows that the flow of execution is from the top of the loop straight to the exit test and back to the start again. Now there's a extra node in the graph, where the flow can leave the loop. It now needs to compile the loop body differently, in two parts.

This almost always results in worse code. Why it doesn't here, I can only offer one guess: you didn't compile with profiling information. Hence, the compiler has to make assumptions. In particular, it must make assumptions about the possibility that the branch will be taken at runtime.

Clearly, since the assumptions it must make are different, it's quite possible that the resulting code differs in speed.

Batting answered 28/10, 2013 at 16:24 Comment(5)

Clearly. I have investigated the assumptions. See the additional info, item 5. And as it is expected the line helps only if the compiler thinks it is in most of the cases equal to true. – Hart 28/10, 2013 at 16:43

Ok, seems this is the right answer then: forcing that assumption causes exactly the same speedup, so it definitely hinges on the assumption and by extension the branch prediction. – Batting 28/10, 2013 at 16:51

But how is this possible? The compiler creates more parts of code and starts to execute wrong one. The code should be slower... – Hart 28/10, 2013 at 16:55

Could well be an unintended pre-fetch of a variable that you'll need below. Different stack layout, too. – Batting 28/10, 2013 at 17:2

I tried different variables. Including early introduction of additional variable == expression, which I never use. Also see item 3. – Hart 28/10, 2013 at 17:3

I hate to say it, but the answer is going to be pretty technical, and more importantly, very specific to your code. So much so, that probably nobody outside of yourself is going to invest the time to investigate the root of your question. As others have suggested, it will quite possibly depend on branch prediction and other post-compile optimizations related to pipelining.

The only thing that I can suggest to help you narrow down if this is a compiler optimization issue, or a post-compile (CPU) optimization, is to compile your code again, with -O2 vs -O3, but this time add the following additional options: -fverbose-asm -S. Pipe each of the outputs to two different files, and then run something like sdiff to compare them. You should see a lot of differences.

Unfortunately, without a good understanding of assembly code, it will be tough to make heads or tails of it, and honestly, not many people on Stack Overflow have the patience (or time) to spend more than a few minutes on this issue. If you aren't fluent in assembly (presumably x86), then I would suggest finding a coworker or friend who is, to help you parse the assembly output.

Yuyuan answered 28/10, 2013 at 21:46 Comment(0)

Recommended topics

Hot tags