Will the compiler replace calls to sqrt(2) in a loop with a single call?

H

3

2

double var = 0.;
for(int i = 0; i < 1000000 ; i++)
{
    var += sqrt(2.0);
}
std::cout << var << std::endl;

Under MSVC2012, is it possible that under release with optimization turn on, sqrt(2.0) will be replaced by the value of the call, instead of calling it 10⁶ times?

Here is the assembly output, though I'm unsure how to interpret it:

; Line 7
    push    ebp
    mov ebp, esp
    and esp, -8                 ; fffffff8H
; Line 11
    movsd   xmm0, QWORD PTR __real@4000000000000000
    call    __libm_sse2_sqrt_precise
    movsd   xmm2, QWORD PTR ?var@@3NA
    mov eax, 1000000                ; 000f4240H
$LL3@main:
    movapd  xmm1, xmm0
    addsd   xmm2, xmm1
    dec eax
    jne SHORT $LL3@main
    movsd   QWORD PTR ?var@@3NA, xmm2
; Line 13
    mov esp, ebp
    pop ebp
    ret 0

Hintze answered 24/12, 2012 at 19:27 Comment(6)

Possible? Yes, sure! Why not? – Lipp 24/12, 2012 at 19:28

"Is it possible?" - yes. "Will it?" - I dunno. Just take a look at the assembly. – Tetragonal 24/12, 2012 at 19:28

Looks like it’s not optimizing that particular thing, no, but are you compiling with optimizations? – Boutique 24/12, 2012 at 19:38

Good question. I guess the compiler doesnt know that sqrt will always get the same value with the same parameters. I don't know how to fix it though – Blackball 24/12, 2012 at 19:42

The result may be different if the OP turns on aggressive floating-point optimization -- I don't know what MSVC calls that particular toggle, but it surely has some equivalent of GCC's -ffast-math. – Anteversion 24/12, 2012 at 19:43

see the update, maybe it will help – Diacritic 24/12, 2012 at 20:11

A

6

If I am reading that assembly dump correctly, the compiler left sqrt in the loop in debug build, and moved it out in the optimized build. But it could have been even more aggressive; the code you show may legitimately be optimized to

std::cout << "1414213.56238\n" << std::flush;

The as-if rule allows the compiler to do anything that does not change the "observable behavior" of the program -- and execution time does not count as observable behavior. The compiler is also allowed to "know" what all standard library functions do and optimize on that basis.

Anteversion answered 24/12, 2012 at 19:38 Comment(0)

D

1

It's obviously being called as expected:

movsd   QWORD PTR [esp], xmm0
call    _sqrt

Edit: One way I could think of to force the compiler to not optimize out the call, without changing the optimization flags, is to pass the value passed to sqrt() on the command line or read it in from stdin:

double var = 0.;
double x;
cin >> x;
for(int i = 0; i < 1000000 ; i++) {
    var += sqrt(x);
}

I believe that should make it impossible to optimize the call because the value is not known at compile time, the loop may still be optimized but you could pass the counter value too.

Diacritic answered 24/12, 2012 at 19:35 Comment(0)

A

1

Yes, MSVC understands that sqrt is a pure function with no side-effects, so it can hoist it out of the loop, as your release-mode asm shows.

It strangely doesn't fold sqrt(2.0) into a compile-time constant the way other compilers do. (GCC does that even at -O0, just adding a constant 1.41... in the loop). Perhaps MSVC is considering the possibility of the rounding mode being set differently? (GCC's default is -fno-rounding-math so it can assume the IEEE default rounding mode for compile-time constant folding.)

MSVC -fp:fast does get it to optimize away the call to sqrt entirely. I made a function that just returns a value instead of all the asm to call cout functions.

#include <cmath>
double foo(){
    double var = 0.;
    for(int i = 0; i < 1000000 ; i++) {
        var += sqrt(2.0);
    }
    return var;
}

(Godbolt) MSVC 19.38 -O2 -fp:fast (or the VEX equivalent with -arch:AVX2)

__real@402c48c6001f0ac2 DQ 0402c48c6001f0ac2r   ; 14.1421

double foo(void) PROC                                        ; foo, COMDAT
        movsd   xmm1, QWORD PTR __real@402c48c6001f0ac2
        xorps   xmm0, xmm0
        mov     eax, 100000                         ; 000186a0H
$LL4@foo:
        addsd   xmm0, xmm1
        sub     rax, 1
        jne     SHORT $LL4@foo
        ret     0
double foo(void) ENDP                                        ; foo

Notice that the actual asm loop isn't unrolled, but it only does 1/10th of the source iterations by adding 14.1421 = 10 * sqrt(2).

Compilers apparently aren't really expecting anyone to write something this pointless, even after other optimizations produce constants, so they don't turn it into just a constant or into 1000000.0 * 1.41421. (With a smaller iteration count, some compilers at least at -O3 would do constant-propagation through enough loop iterations to find out that the result is a constant and just return that.)

Other compilers are similarly amusing in their fast-math efforts:

GCC auto-vectorizes, but only uses the result of one element, instead of doing a horizontal sum! It adds the vector of [0, 1*sqrt2, 2*sqrt2, 3*sqrt2, ...] (or something like that?) at the end instead of the start, and just takes the highest element, which has "only" seen 125000 additions (1000000 / 8 where 8 is the number of double elements in a 512-bit vector. I used AVX-512 since it has the densest code with broadcast source operands.)

# GCC13.2 -O3 -march=x86-64-v4
#   I was expecting the default -mtune for x86-64-v4 to include -mprefer-vector-width=256 like for clang
#   But no, GCC's generic AVX-512 favours 512-bit vectors.  Which is maybe good on Sapphire Rapids as well as Zen 4, especially when GCC doesn't unroll with multiple accumulators!
foo():
        vmovapd zmm0, ZMMWORD PTR .LC0[rip]  # start with [0,1,2,..7] * sqrt(2)
        xor     eax, eax
        vbroadcastsd    zmm2, QWORD PTR .LC2[rip]  # and add 8 * sqrt(2) each iteration
.L2:
        add     eax, 1
        vmovapd zmm1, zmm0              # save a value from before the add?
        vaddpd  zmm0, zmm0, zmm2
        cmp     eax, 125000
        jne     .L2
        vaddpd  zmm1, zmm1, QWORD PTR .LC4[rip]{1to8}  # add 1*sqrt(2) to each element.
        valignq zmm0, zmm1, zmm1, 7                    # extract the top element as the scalar return value (bottom of XMM0)
        vzeroupper
        ret

# .rodata
.LC0:
        .long   0
        .long   0                 # double 0.0
        .long   1719614413
        .long   1073127582        # double 1.4142135623730951
        .long   1719614413
        .long   1074176158        # double 2.8284271247461903
        .long   -857772838
        .long   1074854006        # double 4.2426406871192857  = 3*sqrt(2)
          ... up to 7 * sqrt(2)

.LC2:
        .long   1719614413          # fun fact: same low half (part of mantissa) as 1*sqrt(2) because multiply by a power of 2 only changed the exponent.
        .long   1076273310        # 8*sqrt(2)

        .set    .LC4,.LC0+8   # i.e. 1.4142135623730951 = 1 * sqrt(2)

So GCC generated all 1000000 intermediate values of val along the way (8 at a time), for no reason.

Clang does what you'd expect/hope, auto-vectorizing with 4 xmm, ymm or zmm accumulators (depending on -march= options) exactly the way it would for the sum of an array, including some shuffle/add steps for a horizontal sum at the end. It starts with 0 and adds 1 * sqrt(2) to each element, actually doing 1000000 additions with good SIMD and instruction-level parallelism.

Newer MSVC does know how to partially inline sqrt (the way GCC does for runtime variables without -fno-math-errno), which results in amusing asm which, even at -O2, loads a constant into a register and compares it against zero! Even if it insists on computing sqrt(2.0) at run-time, it's still a missed optimization to use the full canned sequence, not optimizing away that compare against 0.0. No math rounding mode settings can make that false.

(Godbolt) MSVC 19.38 -O2 without any fast-math or arch options.

__real@4000000000000000 DQ 04000000000000000r   ; 2

double foo(void) PROC                                        ; foo, COMDAT
$LN18:
        sub     rsp, 56                             ; 00000038H
        movsd   xmm1, QWORD PTR __real@4000000000000000
        xorps   xmm0, xmm0                    ; xmm0 = 0
        ucomisd xmm0, xmm1                    ; compare 0.0 vs 2.0
        movaps  XMMWORD PTR [rsp+32], xmm6   ; why XMM6?  Pick a call-clobbered reg like XMM3.
            ; oh, because it wants to init the sum before call sqrt
            ;  It would be cheaper just to zero XMM0  after call sqrt / movaps xmm1, xmm0
        xorps   xmm6, xmm6
        ja      SHORT $LN13@foo              ; if 0.0 > 2.0 goto  call the library function
        sqrtpd  xmm1, xmm1                   ; else use the inlined version for non-NaN results.
        jmp     SHORT $LN14@foo
$LN13@foo:
        movaps  xmm0, xmm1
        call    sqrt
        movaps  xmm1, xmm0
$LN14@foo:
        mov     eax, 100000                         ; 000186a0H
        npad    13                    ; align the top of the loop
$LL8@foo:
        addsd   xmm6, xmm1
        addsd   xmm6, xmm1
        addsd   xmm6, xmm1
        addsd   xmm6, xmm1
        addsd   xmm6, xmm1
        addsd   xmm6, xmm1
        addsd   xmm6, xmm1
        addsd   xmm6, xmm1
        addsd   xmm6, xmm1
        addsd   xmm6, xmm1          ; unrolled by 10
        sub     rax, 1
        jne     SHORT $LL8@foo
        movaps  xmm0, xmm6
        movaps  xmm6, XMMWORD PTR [rsp+32]    ; restore call-preserved XMM6
        add     rsp, 56                             ; 00000038H
        ret     0
double foo(void) ENDP                                        ; foo

Unrolling is pretty much pointless (especially with such a large repeat count), the bottleneck will be FP addition latency. With a very small repeat count, unrolling can help out-of-order exec could see past it without using up as much ROB (ReOrder Buffer) capacity on loop overhead.

Amusingly, even GCC unrolls by 2, even though GCC never unrolls without profile-guided optimization or an explicit -funroll-loops. (Except to fully-unroll, which it calls loop peeling.) That amount of unroll seems fine, not hurting anything, and does steal less front-end throughput from the other logical core. But surprisingly clang doesn't unroll at all. Clang loves to unroll when multiple dependency chains are possible, unlike here where the default -fno-fast-math forces all the additions to be done in order.

# GCC13.2 -O3   (no fast-math)
foo():
        movsd   xmm1, QWORD PTR .LC1[rip]  # tmp = sqrt(2) compile-time constant
        mov     eax, 1000000
        pxor    xmm0, xmm0               # var = 0
.L2:
        addsd   xmm0, xmm1
        addsd   xmm0, xmm1
        sub     eax, 2
        jne     .L2
        ret
.LC1:
        .long   1719614413
        .long   1073127582

Awesome answered 6/1, 2024 at 11:25 Comment(2)

Nice answer. I've cleaned up a few more Q&As in the compiler-optimization and loop-invariant tags, if you want to have a crack at it. – Camber 6/1, 2024 at 13:9

@JanSchultke: I follow the [compiler-optimization] tag; that's how I noticed this one. Nice work on your cleanups to old high-value questions here and in other cases I've noticed in the past; those take more effort than just cosmetic edits but are much more valuable. – Awesome 6/1, 2024 at 13:19

Recommended topics

Hot tags