Why sqrt become much faster without -O2 in g++ on my computer?
Asked Answered
B

2

10

Consider the following code:

#include <cstdio>
#include <cmath>

const int COUNT = 1000000000;

int main()
{
    double sum = 0;
    for (int i = 1; i <= COUNT; ++i) {
        sum += sqrt(i);
    }
    printf("%f\n", sum);
    return 0;
}

Without -O2, it runs only 2.9s on my computer, while it runs 6.4s with -O2.

My computer is Fedora 23, with g++ 5.3.1.

I have tried the same thing on Ubuntu 14.04 (with g++ 4.8), it doesn't have the problem (all 6.4s).

Bruton answered 5/5, 2016 at 6:31 Comment(18)
And how did you arrive at these timing results - there's no timing mechanism in your code?Nanete
@Nanete Presumably with the time command. You don't need to create your own timing mechanism if a perfectly good one is already readily available.Renegado
I get similar results....!?! (also with GCC 6.1.0)Anaya
My 5-year-old macbook air (yeah, kinda weak) running apple's llvm clang gets 10.352s without -O2, 7.687 with -O2. I feel its probably worth your time to compile-to-asm and inspect the code. My inspection of the clang-generated asm shows the optimized version partially unrolls the loop. It would be interesting to see what gcc is doing.Lilalilac
With -O, GCC uses a sqrtsd %xmm0,%xmm0 instruction. With -O2, GCC uses a sqrtsd %xmm0,%xmm1 instruction, which on my system increases the time by 2s. If I take the -O2 assembly code, change that, and change the remaining %xmm1 references to %xmm0, the time goes down by 2s again. But I have no idea why it's faster, nor why if it's faster, GCC doesn't use the faster version.Renegado
Show full compiler flags that you usedPreponderant
@Nanete Just as hvd said, I use time.Bruton
@Preponderant Just g++ tmp -o tmp and g++ tmp -o tmp -O2, nothing more.Bruton
Probably better never-than-late, but godbolt.org is an awesome place to inspect asm-generation from different compilers (or same compilers with different optimization levels). I highly recc you put it on your favorites bar for future reference.Lilalilac
@Lilalilac Thanks. I will try it.Bruton
I'm 99% sure this question is a duplicate of Why sqrt in global scope is much slower than std::sqrt in MinGW?. It seems to point out that the C++11 sqrt int overload is faster than the built-in sqrt (implemented by the C runtime, in this case glibc). Someone decided to downvote my answer anyways.Colier
@user6292850 That downvote was mine, and I downvoted because it's wrong and my comment on this question already showed that it's wrong well before you posted your answer.Renegado
This looks like a bug in the gcc optimizer. Clang doesn't seem to have it.Thayer
@hvd Maybe I'm missing something. Can you explain how that's related to why my answer is wrong?Colier
@user6292850 If I'm able to see the difference in performance with exactly zero change to which functions get called, the problem can't be in which functions get called.Renegado
@hvd I think the reason why we're seeing a difference is because you used -O, which is not the same as no optimizations.Colier
@user6292850 The performance with -O0 and -O was the same. I used -O because the generated assembly code with -O was much closer to the generated assembly code with -O2, making it easier to pin-point the specific instructions having a problem.Renegado
Perhaps another point of interest: with the exact same compiler, exact same source code, exact same distribution and hence exact same generated program, but running on older computer with older CPU ("model name : Intel(R) Core(TM) i5 CPU 650 @ 3.20GHz", from somewhere around 2011, IIRC), there isn't any big difference between -O0/-O/-O2 performance. Is GCC perhaps simply optimising for older CPUs?Renegado
I
3

Naive version uses call to glibc sqrt function.

Optimized version uses SSE sqrtsd instruction. But after instruction has completed, it checks that result value is not a NaN. If result value is NaN then it calls glibc sqrt function to setup proper error flags (see manual page for math_error(7)). See Why does compiler generate additional sqrts in the compiled assembly code for detailed explanation.

Why gcc thinks that this is faster? Nobody knows. If you are sure that your numbers don't generate NaNs, use -fno-math-errno compile option.

Inconsequential answered 5/5, 2016 at 7:7 Comment(4)
With -ffast-math, this check is also removed and generates faster code.Farthest
Except that at -O, GCC does use the sqrtsd instruction (with different operands), yet achieves better performance than at -O2.Renegado
@hvd , you are correct. Seems that branch prediction is somehow involved? -O2 generates jp <block with call to sqrt> and -O uses jnp <block without call to sqrt>.Inconsequential
I don't think so, see my comment on the question, but I have no idea what would be the cause.Renegado
C
0

Investigating the assembly may turn up some answers, but the easiest way to see the difference in code is to do -fdump-tree-optimized. The issue seems to be related to sqrt overloads, namely the one provided by the C library sqrt(double) and C++11 sqrt(int). The latter seems to be faster, and GCC doesn't seem to care whether you use -std=c++11 or prefix std:: to sqrt or not.

Here's an excerpt for the dump with -O2 or -O (-O with no number enables optimizations, to disable all optimizations, omit -O):

  int i;
  double sum;
  double _9;
  __type _10;

  <bb 2>:

  <bb 3>:
  # sum_15 = PHI <sum_6(3), 0.0(2)>
  # i_16 = PHI <i_7(3), 1(2)>
  _9 = (double) i_16;
  _10 = __builtin_sqrt (_9);
  sum_6 = _10 + sum_15;
  i_7 = i_16 + 1;
  if (i_7 == 1000000001)
    goto <bb 4>;
  else
    goto <bb 3>;

Then without -O2:

  <bb 4>:
  _8 = std::sqrt<int> (i_2);
  sum_9 = sum_1 + _8; 
  i_10 = i_2 + 1; 
  goto <bb 3>; 

Notice it uses std::sqrt<int>. For the skeptical, please see Why sqrt in global scope is much slower than std::sqrt in MinGW?

Colier answered 5/5, 2016 at 7:28 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.