Why sqrt become much faster without -O2 in g++ on my computer?

Asked 5/5, 2016 at 6:31 Answered 5/5, 2016 at 7:28

Consider the following code:

#include <cstdio>
#include <cmath>

const int COUNT = 1000000000;

int main()
{
    double sum = 0;
    for (int i = 1; i <= COUNT; ++i) {
        sum += sqrt(i);
    }
    printf("%f\n", sum);
    return 0;
}

Without -O2, it runs only 2.9s on my computer, while it runs 6.4s with -O2.

My computer is Fedora 23, with g++ 5.3.1.

I have tried the same thing on Ubuntu 14.04 (with g++ 4.8), it doesn't have the problem (all 6.4s).

Bruton answered 5/5, 2016 at 6:31 Comment(18)

And how did you arrive at these timing results - there's no timing mechanism in your code? – Nanete 5/5, 2016 at 6:32

@Nanete Presumably with the time command. You don't need to create your own timing mechanism if a perfectly good one is already readily available. – Renegado 5/5, 2016 at 6:34

I get similar results....!?! (also with GCC 6.1.0) – Anaya 5/5, 2016 at 6:36

My 5-year-old macbook air (yeah, kinda weak) running apple's llvm clang gets 10.352s without -O2, 7.687 with -O2. I feel its probably worth your time to compile-to-asm and inspect the code. My inspection of the clang-generated asm shows the optimized version partially unrolls the loop. It would be interesting to see what gcc is doing. – Lilalilac 5/5, 2016 at 6:54

With -O, GCC uses a sqrtsd %xmm0,%xmm0 instruction. With -O2, GCC uses a sqrtsd %xmm0,%xmm1 instruction, which on my system increases the time by 2s. If I take the -O2 assembly code, change that, and change the remaining %xmm1 references to %xmm0, the time goes down by 2s again. But I have no idea why it's faster, nor why if it's faster, GCC doesn't use the faster version. – Renegado 5/5, 2016 at 7:6

Show full compiler flags that you used – Preponderant 5/5, 2016 at 7:10

@Nanete Just as hvd said, I use time. – Bruton 5/5, 2016 at 7:11

@Preponderant Just g++ tmp -o tmp and g++ tmp -o tmp -O2, nothing more. – Bruton 5/5, 2016 at 7:13

Probably better never-than-late, but godbolt.org is an awesome place to inspect asm-generation from different compilers (or same compilers with different optimization levels). I highly recc you put it on your favorites bar for future reference. – Lilalilac 5/5, 2016 at 7:24

@Lilalilac Thanks. I will try it. – Bruton 5/5, 2016 at 7:37

I'm 99% sure this question is a duplicate of Why sqrt in global scope is much slower than std::sqrt in MinGW?. It seems to point out that the C++11 sqrt int overload is faster than the built-in sqrt (implemented by the C runtime, in this case glibc). Someone decided to downvote my answer anyways. – Colier 5/5, 2016 at 7:40

@user6292850 That downvote was mine, and I downvoted because it's wrong and my comment on this question already showed that it's wrong well before you posted your answer. – Renegado 5/5, 2016 at 7:40

This looks like a bug in the gcc optimizer. Clang doesn't seem to have it. – Thayer 5/5, 2016 at 7:41

@hvd Maybe I'm missing something. Can you explain how that's related to why my answer is wrong? – Colier 5/5, 2016 at 7:43

@user6292850 If I'm able to see the difference in performance with exactly zero change to which functions get called, the problem can't be in which functions get called. – Renegado 5/5, 2016 at 7:45

@hvd I think the reason why we're seeing a difference is because you used -O, which is not the same as no optimizations. – Colier 5/5, 2016 at 7:50

@user6292850 The performance with -O0 and -O was the same. I used -O because the generated assembly code with -O was much closer to the generated assembly code with -O2, making it easier to pin-point the specific instructions having a problem. – Renegado 5/5, 2016 at 7:55

Perhaps another point of interest: with the exact same compiler, exact same source code, exact same distribution and hence exact same generated program, but running on older computer with older CPU ("model name : Intel(R) Core(TM) i5 CPU 650 @ 3.20GHz", from somewhere around 2011, IIRC), there isn't any big difference between -O0/-O/-O2 performance. Is GCC perhaps simply optimising for older CPUs? – Renegado 5/5, 2016 at 9:17

Naive version uses call to glibc sqrt function.

Optimized version uses SSE sqrtsd instruction. But after instruction has completed, it checks that result value is not a NaN. If result value is NaN then it calls glibc sqrt function to setup proper error flags (see manual page for math_error(7)). See Why does compiler generate additional sqrts in the compiled assembly code for detailed explanation.

Why gcc thinks that this is faster? Nobody knows. If you are sure that your numbers don't generate NaNs, use -fno-math-errno compile option.

Inconsequential answered 5/5, 2016 at 7:7 Comment(4)

With -ffast-math, this check is also removed and generates faster code. – Farthest 5/5, 2016 at 7:10

Except that at -O, GCC does use the sqrtsd instruction (with different operands), yet achieves better performance than at -O2. – Renegado 5/5, 2016 at 7:10

@hvd , you are correct. Seems that branch prediction is somehow involved? -O2 generates jp <block with call to sqrt> and -O uses jnp <block without call to sqrt>. – Inconsequential 5/5, 2016 at 7:19

I don't think so, see my comment on the question, but I have no idea what would be the cause. – Renegado 5/5, 2016 at 7:23

Investigating the assembly may turn up some answers, but the easiest way to see the difference in code is to do -fdump-tree-optimized. The issue seems to be related to sqrt overloads, namely the one provided by the C library sqrt(double) and C++11 sqrt(int). The latter seems to be faster, and GCC doesn't seem to care whether you use -std=c++11 or prefix std:: to sqrt or not.

Here's an excerpt for the dump with -O2 or -O (-O with no number enables optimizations, to disable all optimizations, omit -O):

  int i;
  double sum;
  double _9;
  __type _10;

  <bb 2>:

  <bb 3>:
  # sum_15 = PHI <sum_6(3), 0.0(2)>
  # i_16 = PHI <i_7(3), 1(2)>
  _9 = (double) i_16;
  _10 = __builtin_sqrt (_9);
  sum_6 = _10 + sum_15;
  i_7 = i_16 + 1;
  if (i_7 == 1000000001)
    goto <bb 4>;
  else
    goto <bb 3>;

Then without -O2:

  <bb 4>:
  _8 = std::sqrt<int> (i_2);
  sum_9 = sum_1 + _8; 
  i_10 = i_2 + 1; 
  goto <bb 3>;

Notice it uses std::sqrt<int>. For the skeptical, please see Why sqrt in global scope is much slower than std::sqrt in MinGW?

Colier answered 5/5, 2016 at 7:28 Comment(0)

Recommended topics

Hot tags