Yes, MSVC understands that sqrt
is a pure function with no side-effects, so it can hoist it out of the loop, as your release-mode asm shows.
It strangely doesn't fold sqrt(2.0)
into a compile-time constant the way other compilers do. (GCC does that even at -O0
, just adding a constant 1.41...
in the loop). Perhaps MSVC is considering the possibility of the rounding mode being set differently? (GCC's default is -fno-rounding-math
so it can assume the IEEE default rounding mode for compile-time constant folding.)
MSVC -fp:fast
does get it to optimize away the call to sqrt
entirely. I made a function that just returns a value instead of all the asm to call cout
functions.
#include <cmath>
double foo(){
double var = 0.;
for(int i = 0; i < 1000000 ; i++) {
var += sqrt(2.0);
}
return var;
}
(Godbolt) MSVC 19.38 -O2 -fp:fast
(or the VEX equivalent with -arch:AVX2
)
__real@402c48c6001f0ac2 DQ 0402c48c6001f0ac2r ; 14.1421
double foo(void) PROC ; foo, COMDAT
movsd xmm1, QWORD PTR __real@402c48c6001f0ac2
xorps xmm0, xmm0
mov eax, 100000 ; 000186a0H
$LL4@foo:
addsd xmm0, xmm1
sub rax, 1
jne SHORT $LL4@foo
ret 0
double foo(void) ENDP ; foo
Notice that the actual asm loop isn't unrolled, but it only does 1/10th of the source iterations by adding 14.1421 = 10 * sqrt(2)
.
Compilers apparently aren't really expecting anyone to write something this pointless, even after other optimizations produce constants, so they don't turn it into just a constant or into 1000000.0 * 1.41421
. (With a smaller iteration count, some compilers at least at -O3
would do constant-propagation through enough loop iterations to find out that the result is a constant and just return that.)
Other compilers are similarly amusing in their fast-math efforts:
GCC auto-vectorizes, but only uses the result of one element, instead of doing a horizontal sum! It adds the vector of [0, 1*sqrt2, 2*sqrt2, 3*sqrt2, ...]
(or something like that?) at the end instead of the start, and just takes the highest element, which has "only" seen 125000
additions (1000000 / 8 where 8 is the number of double
elements in a 512-bit vector.
I used AVX-512 since it has the densest code with broadcast source operands.)
# GCC13.2 -O3 -march=x86-64-v4
# I was expecting the default -mtune for x86-64-v4 to include -mprefer-vector-width=256 like for clang
# But no, GCC's generic AVX-512 favours 512-bit vectors. Which is maybe good on Sapphire Rapids as well as Zen 4, especially when GCC doesn't unroll with multiple accumulators!
foo():
vmovapd zmm0, ZMMWORD PTR .LC0[rip] # start with [0,1,2,..7] * sqrt(2)
xor eax, eax
vbroadcastsd zmm2, QWORD PTR .LC2[rip] # and add 8 * sqrt(2) each iteration
.L2:
add eax, 1
vmovapd zmm1, zmm0 # save a value from before the add?
vaddpd zmm0, zmm0, zmm2
cmp eax, 125000
jne .L2
vaddpd zmm1, zmm1, QWORD PTR .LC4[rip]{1to8} # add 1*sqrt(2) to each element.
valignq zmm0, zmm1, zmm1, 7 # extract the top element as the scalar return value (bottom of XMM0)
vzeroupper
ret
# .rodata
.LC0:
.long 0
.long 0 # double 0.0
.long 1719614413
.long 1073127582 # double 1.4142135623730951
.long 1719614413
.long 1074176158 # double 2.8284271247461903
.long -857772838
.long 1074854006 # double 4.2426406871192857 = 3*sqrt(2)
... up to 7 * sqrt(2)
.LC2:
.long 1719614413 # fun fact: same low half (part of mantissa) as 1*sqrt(2) because multiply by a power of 2 only changed the exponent.
.long 1076273310 # 8*sqrt(2)
.set .LC4,.LC0+8 # i.e. 1.4142135623730951 = 1 * sqrt(2)
So GCC generated all 1000000 intermediate values of val
along the way (8 at a time), for no reason.
Clang does what you'd expect/hope, auto-vectorizing with 4 xmm, ymm or zmm accumulators (depending on -march=
options) exactly the way it would for the sum of an array, including some shuffle/add steps for a horizontal sum at the end. It starts with 0
and adds 1 * sqrt(2)
to each element, actually doing 1000000 additions with good SIMD and instruction-level parallelism.
Newer MSVC does know how to partially inline sqrt
(the way GCC does for runtime variables without -fno-math-errno
), which results in amusing asm which, even at -O2
, loads a constant into a register and compares it against zero! Even if it insists on computing sqrt(2.0) at run-time, it's still a missed optimization to use the full canned sequence, not optimizing away that compare against 0.0
. No math rounding mode settings can make that false.
(Godbolt) MSVC 19.38 -O2
without any fast-math or arch options.
__real@4000000000000000 DQ 04000000000000000r ; 2
double foo(void) PROC ; foo, COMDAT
$LN18:
sub rsp, 56 ; 00000038H
movsd xmm1, QWORD PTR __real@4000000000000000
xorps xmm0, xmm0 ; xmm0 = 0
ucomisd xmm0, xmm1 ; compare 0.0 vs 2.0
movaps XMMWORD PTR [rsp+32], xmm6 ; why XMM6? Pick a call-clobbered reg like XMM3.
; oh, because it wants to init the sum before call sqrt
; It would be cheaper just to zero XMM0 after call sqrt / movaps xmm1, xmm0
xorps xmm6, xmm6
ja SHORT $LN13@foo ; if 0.0 > 2.0 goto call the library function
sqrtpd xmm1, xmm1 ; else use the inlined version for non-NaN results.
jmp SHORT $LN14@foo
$LN13@foo:
movaps xmm0, xmm1
call sqrt
movaps xmm1, xmm0
$LN14@foo:
mov eax, 100000 ; 000186a0H
npad 13 ; align the top of the loop
$LL8@foo:
addsd xmm6, xmm1
addsd xmm6, xmm1
addsd xmm6, xmm1
addsd xmm6, xmm1
addsd xmm6, xmm1
addsd xmm6, xmm1
addsd xmm6, xmm1
addsd xmm6, xmm1
addsd xmm6, xmm1
addsd xmm6, xmm1 ; unrolled by 10
sub rax, 1
jne SHORT $LL8@foo
movaps xmm0, xmm6
movaps xmm6, XMMWORD PTR [rsp+32] ; restore call-preserved XMM6
add rsp, 56 ; 00000038H
ret 0
double foo(void) ENDP ; foo
Unrolling is pretty much pointless (especially with such a large repeat count), the bottleneck will be FP addition latency. With a very small repeat count, unrolling can help out-of-order exec could see past it without using up as much ROB (ReOrder Buffer) capacity on loop overhead.
Amusingly, even GCC unrolls by 2, even though GCC never unrolls without profile-guided optimization or an explicit -funroll-loops
. (Except to fully-unroll, which it calls loop peeling.) That amount of unroll seems fine, not hurting anything, and does steal less front-end throughput from the other logical core. But surprisingly clang doesn't unroll at all. Clang loves to unroll when multiple dependency chains are possible, unlike here where the default -fno-fast-math
forces all the additions to be done in order.
# GCC13.2 -O3 (no fast-math)
foo():
movsd xmm1, QWORD PTR .LC1[rip] # tmp = sqrt(2) compile-time constant
mov eax, 1000000
pxor xmm0, xmm0 # var = 0
.L2:
addsd xmm0, xmm1
addsd xmm0, xmm1
sub eax, 2
jne .L2
ret
.LC1:
.long 1719614413
.long 1073127582
-ffast-math
. – Anteversion