Investigating the assembly may turn up some answers, but the easiest way to see the difference in code is to do -fdump-tree-optimized
. The issue seems to be related to sqrt
overloads, namely the one provided by the C library sqrt(double)
and C++11 sqrt(int)
. The latter seems to be faster, and GCC doesn't seem to care whether you use -std=c++11
or prefix std::
to sqrt
or not.
Here's an excerpt for the dump with -O2
or -O
(-O
with no number enables optimizations, to disable all optimizations, omit -O
):
int i;
double sum;
double _9;
__type _10;
<bb 2>:
<bb 3>:
# sum_15 = PHI <sum_6(3), 0.0(2)>
# i_16 = PHI <i_7(3), 1(2)>
_9 = (double) i_16;
_10 = __builtin_sqrt (_9);
sum_6 = _10 + sum_15;
i_7 = i_16 + 1;
if (i_7 == 1000000001)
goto <bb 4>;
else
goto <bb 3>;
Then without -O2
:
<bb 4>:
_8 = std::sqrt<int> (i_2);
sum_9 = sum_1 + _8;
i_10 = i_2 + 1;
goto <bb 3>;
Notice it uses std::sqrt<int>
. For the skeptical, please see Why sqrt in global scope is much slower than std::sqrt in MinGW?
time
command. You don't need to create your own timing mechanism if a perfectly good one is already readily available. – Renegado-O
, GCC uses asqrtsd %xmm0,%xmm0
instruction. With-O2
, GCC uses asqrtsd %xmm0,%xmm1
instruction, which on my system increases the time by 2s. If I take the-O2
assembly code, change that, and change the remaining%xmm1
references to%xmm0
, the time goes down by 2s again. But I have no idea why it's faster, nor why if it's faster, GCC doesn't use the faster version. – Renegadotime
. – Brutong++ tmp -o tmp
andg++ tmp -o tmp -O2
, nothing more. – Bruton-O
, which is not the same as no optimizations. – Colier-O0
and-O
was the same. I used-O
because the generated assembly code with-O
was much closer to the generated assembly code with-O2
, making it easier to pin-point the specific instructions having a problem. – Renegado-O0
/-O
/-O2
performance. Is GCC perhaps simply optimising for older CPUs? – Renegado