Understanding why an ASM fsqrt implementation is faster than the standard sqrt function

Asked 22/9, 2016 at 6:8 Answered 25/9, 2016 at 3:31

I have playing around with basic math function implementations in C++ for academic purposes. Today, I benchmarked the following code for Square Root:

inline float sqrt_new(float n)
{
    __asm {
        fld n
            fsqrt
    }
}

I was surprised to see that it is consistently faster than the standard sqrt function (it takes around 85% of the execution time of the standard function).

I don't quite get why and would love to better understand it. Below I show the full code I am using to profile (in Visual Studio 2015, compiling in Release mode and with all optimizations turned on):

#include <iostream>
#include <random>
#include <chrono>
#define M 1000000

float ranfloats[M];

using namespace std;

inline float sqrt_new(float n)
{
    __asm {
        fld n
            fsqrt
    }
}

int main()
{
    default_random_engine randomGenerator(time(0));
    uniform_real_distribution<float> diceroll(0.0f , 1.0f);

chrono::high_resolution_clock::time_point start1, start2;
chrono::high_resolution_clock::time_point end1, end2;
float sqrt1 = 0;
float sqrt2 = 0;


for (int i = 0; i<M; i++) ranfloats[i] = diceroll(randomGenerator);


start1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i<M; i++) sqrt1 += sqrt(ranfloats[i]);
end1 = std::chrono::high_resolution_clock::now();

start2 = std::chrono::high_resolution_clock::now();
for (int i = 0; i<M; i++) sqrt2 += sqrt_new(ranfloats[i]);
end2 = std::chrono::high_resolution_clock::now();

auto time1 = std::chrono::duration_cast<std::chrono::milliseconds>(end1 - start1).count();
auto time2  = std::chrono::duration_cast<std::chrono::milliseconds>(end2 - start2).count();


cout << "Time elapsed for SQRT1: " << time1 << " seconds" << endl;
cout << "Time elapsed for SQRT2: " << time2 << " seconds" << endl;

cout << "Average of Time for SQRT2 / Time for SQRT1: " << time2 / time1  << endl;
cout << "Equal to standard sqrt? " << (sqrt1 == sqrt2) << endl;



system("pause");
return 0;
}

EDIT: I am editing the question to include disassembly codes of both loops that calculate square roots as they came at Visual Studio 2015.

First, the disassembly for for (int i = 0; i<M; i++) sqrt1 += sqrt(ranfloats[i]);:

00091194 0F 5A C0             cvtps2pd    xmm0,xmm0  
00091197 E8 F2 18 00 00       call        __libm_sse2_sqrt_precise (092A8Eh)  
0009119C F2 0F 5A C0          cvtsd2ss    xmm0,xmm0  
000911A0 83 C6 04             add         esi,4  
000911A3 F3 0F 58 44 24 4C    addss       xmm0,dword ptr [esp+4Ch]  
000911A9 F3 0F 11 44 24 4C    movss       dword ptr [esp+4Ch],xmm0  
000911AF 81 FE 90 5C 46 00    cmp         esi,offset __dyn_tls_dtor_callback (0465C90h)  
000911B5 7C D9                jl          main+190h (091190h)

Next, the disassembly for for (int i = 0; i<M; i++) sqrt2 += sqrt_new(ranfloats[i]);:

00091290 F3 0F 10 00          movss       xmm0,dword ptr [eax]  
00091294 F3 0F 11 44 24 6C    movss       dword ptr [esp+6Ch],xmm0  
0009129A D9 44 24 6C          fld         dword ptr [esp+6Ch]  
0009129E D9 FA                fsqrt  
000912A0 D9 5C 24 6C          fstp        dword ptr [esp+6Ch]  
000912A4 F3 0F 10 44 24 6C    movss       xmm0,dword ptr [esp+6Ch]  
000912AA 83 C0 04             add         eax,4  
000912AD F3 0F 58 44 24 54    addss       xmm0,dword ptr [esp+54h]  
000912B3 F3 0F 11 44 24 54    movss       dword ptr [esp+54h],xmm0  
000912B9 ??                   ?? ?? 
000912BA ??                   ?? ?? 
000912BB ??                   ?? ?? 
000912BC ??                   ?? ?? 
000912BD ??                   ?? ?? 
000912BE ??                   ?? ?? 
000912BF ??                   ?? ?? 
000912C0 ??                   ?? ?? 
000912C1 ??                   ?? ?? 
000912C2 ??                   ?? ?? 
000912C3 ??                   ?? ?? 
000912C4 ??                   ?? ?? 
000912C5 ??                   ?? ?? 
000912C6 ??                   ?? ?? 
000912C7 ??                   ?? ?? 
000912C8 ??                   ?? ?? 
000912C9 ??                   ?? ?? 
000912CA ??                   ?? ?? 
000912CB ??                   ?? ?? 
000912CC ??                   ?? ?? 
000912CD ??                   ?? ?? 
000912CE ??                   ?? ?? 
000912CF ??                   ?? ?? 
000912D0 ??                   ?? ?? 
000912D1 ??                   ?? ?? 
000912D2 ??                   ?? ?? 
000912D3 ??                   ?? ?? 
000912D4 ??                   ?? ?? 
000912D5 ??                   ?? ?? 
000912D6 ??                   ?? ?? 
000912D7 ??                   ?? ?? 
000912D8 ??                   ?? ?? 
000912D9 ??                   ?? ?? 
000912DA ??                   ?? ?? 
000912DB ??                   ?? ?? 
000912DC ??                   ?? ?? 
000912DD ??                   ?? ?? 
000912DE ??                   ?? ??

Hoehne answered 22/9, 2016 at 6:8 Comment(21)

One of the reasons may be that the standard sqrt comes from a linked library, thus it cannot be inlined and is prone to calling overhead, while your implementation is perfectly inlineable. – Cubical 22/9, 2016 at 6:15

I get the same times for both on my computer. Using the /fp:fast option makes the library version almost twice as fast. – Graffito 22/9, 2016 at 6:57

@Cubical - Doesn't vc have an intrinsic for sqrt? Since OP says he has "all optimizations turned on", I would expect it to be used. – Clapboard 22/9, 2016 at 7:24

@DavidWohlferd probably you are right. I'm not very experienced with MS compilers, so it was just a hypothesis. – Cubical 22/9, 2016 at 7:26

I guess if I really wanted to know the answer I'd output the the assembler (/FAs) and examine the code. – Clapboard 22/9, 2016 at 7:56

Your microbenchmark only tests sqrt throughput, not latency (sqrt throughput is lower than FP add latency on most CPUs; on Skylake you're almost testing that instead). Out-of-order execution can hide all the latency from any store-forwarding round-trips MSVC creates when you use inline asm.. I suspect that the standard-library function must be running enough extra instructions that you get a resource conflict (so its FSQRT doesn't start on the earliest possible cycle because p0 is busy) – Attempt 22/9, 2016 at 9:56

You didn't say what hardware you're testing on, BTW. If it's Intel Skylake, FSQRT has a throughput of one per 4-7 cycles, while SQRTPS (on a vector of four floats) has a throughput of one per 3 cycles. If you're going to write your own sqrt function that doesn't waste time setting errno and stuff, do it with <immintrin.h> intrinsics, because inline asm badly gimps the optimizer (even worse than in gcc, but see gcc.gnu.org/wiki/DontUseInlineAsm anyway). Use _mm_sqrt_ss(). – Attempt 22/9, 2016 at 9:59

@Cubical I tried without inlining the function, and the same results hold. – Hoehne 23/9, 2016 at 4:31

@PeterCordes Thanks for the tips! So, my machine runs a CPU Intel Core i7 4700-HQ and Windows 8.1 64bits. – Hoehne 23/9, 2016 at 4:33

@Hoehne Did you disassemble compiler's output? Compiler could inline it automatically. Omitting inline keyword does not mean that compiler will not inline it. Disassembling is the only reliable way to tell if it is inlined or not. – Cubical 23/9, 2016 at 4:41

@RossRidge /fp:fast would affect the accuracy of all floating-point operations, not only of sqrt. – Hoehne 23/9, 2016 at 4:54

@Cubical No I did not. But actually, since there are still few clues on what is going on, I think the best thing to do is actually to edit the question with the full disassembly – Hoehne 23/9, 2016 at 4:56

Yep, disassembly for each of the two loops would be good. The one with your function inlined, and the one that calls or inlines the sqrt library function. – Attempt 23/9, 2016 at 4:58

/fp:fast is generally not that dangerous, unless you've specifically chosen the order of operations for some reason, or you expect to encounter NaNs and still get useful results. e.g. if you know your array alternates between large and small numbers, and you don't want auto-vectorization to end up adding just the small numbers together until they're big enough to be > 1 ulp of the sum of the large numbers. Or if you need exactly reproducible output across different builds with different compiler versions. – Attempt 23/9, 2016 at 5:2

The /fp:fast option doesn't so much affect accuracy, so much as conformance. It can actually make results more precise. Since your sqrt implementation isn't compliant (eg. you don't set errno when taking the sqrt of a negative number), it's actually a fair comparison in this case. – Graffito 23/9, 2016 at 5:3

@PeterCordes Here it goes. I included the disassembly codes for both loops that calculate square roots. – Hoehne 24/9, 2016 at 3:20

You left out the end of the loop for the second loop. Also, apparently you're calling sqrt(double) not sqrtf(float). That's weird, I thought C++ took care of this. Hmm, I don't see #include <cmath> or math.h, like en.cppreference.com/w/cpp/numeric/math/sqrt says you need for std::sqrt type overloads. Note the cvtps2pd conversion to double and back, and the store/reload inside the inner loop in the version that calls the library function. That's some pretty dumb compiler output (I thought the Windows ABI had some call-preserved XMM registers which would be perfect for this). – Attempt 24/9, 2016 at 3:35

Anyway, both of those loops are horrible, and might not even be bottlenecking on the sqrt instruction. Your Haswell CPU has a FSQRT throughput of one per 8-17 cycles (vs. one per 7 cycles for SQRTSS/SQRTPS (SSE vector of 4 floats)). And neither of them auto-vectorized. Even without associative FP math, a good compiler should be able to use SQRTPS and bottleneck on FP add latency, instead of sqrt throughput. Of course, it probably can't do that without /fp:fast because it doesn't know that the code will run with FP exceptions masked, I guess. – Attempt 24/9, 2016 at 3:42

What are the actual execution times, in cycles per iteration? – Attempt 24/9, 2016 at 3:45

oh BTW, you also left out the first instruction of the first loop. The JL target is 00091190, which I assume is movss xmm0, [esi]. – Attempt 24/9, 2016 at 3:53

related: https://mcmap.net/q/1923201/-call-asm-sqrtsd-under-a-c – Attempt 24/9, 2016 at 4:5

Both your loops come out pretty horrible, with many bottlenecks other than the sqrt function call or the FSQRT instruction. And at least 2x slower than optimal scalar SQRTSS (single-precision) code could do. And that's maybe 8x slower than what a decent SSE2 vectorized loop could achieve. Even without reordering any math operations, you could beat SQRTSS throughput.

Many of the reasons from https://gcc.gnu.org/wiki/DontUseInlineAsm apply to your example. The compiler won't be able to propagate constants through your function, and it won't know that the result is alway non-negative (if it isn't NaN). It also won't be able to optimize it into an fabs() if you later square the number.

Also highly important, you defeat auto-vectorization with SSE2 SQRTPS (_mm_sqrt_ps()). A "no-error-checking" scalar sqrt() function using intrinsics also suffers from that problem. IDK if there's any way to get optimal results without /fp:fast, but I doubt it. (Other than writing a whole loop in assembly, or vectorizing the whole loop yourself with intrinsics).

It's pretty impressive that your Haswell CPU manages to run the function-call loop as fast as it does, although the inline-asm loop may not even be saturating FSQRT throughput either.

For some reason, your library function call is calling double sqrt(double), not the C++ overload float sqrt(float). This leads to a conversion to double and back to float. Probably you need to #include <cmath> to get the overloads, or you could call sqrtf(). gcc and clang on Linux call sqrtf() with your current code (without converting to double and back), but maybe their <random> header happens to include <cmath>, and MSVC's doesn't. Or maybe there's something else going on.

The library function-call loop keeps the sum in memory (instead of a register). Apparently the calling convention used by the 32-bit version of __libm_sse2_sqrt_precise doesn't preserve any XMM registers. The Windows x64 ABI does preserve XMM6-XMM15, but wikipedia says this is new and the 32-bit ABI didn't do that. I assume if there were any call-preserved XMM registers, MSVC's optimizer would take advantage of them.

Anyway, besides the throughput bottleneck of calling sqrt on each independent scalar float, the loop-carried dependency on sqrt1 is a latency bottleneck that includes a store-forwarding round trip:

000911A3 F3 0F 58 44 24 4C    addss       xmm0,dword ptr [esp+4Ch]  
000911A9 F3 0F 11 44 24 4C    movss       dword ptr [esp+4Ch],xmm0

Out of order execution lets rest of the code for each iteration overlap, so you just bottleneck on throughput, but no matter how efficient the library sqrt function is, this latency bottleneck limits the loop to one iteration per 6 + 3 = 9 cycles. (Haswell ADDSS latency = 3, store-forwarding latency for XMM load/store = 6 cycles. 1 cycle more than store-forwarding for integer registers. See Agner Fog's instruction tables.)

SQRTSD has a throughput of one per 8-14 cycles, so the loop-carried dependency is not the limiting bottleneck on Haswell.

The inline-asm version with has a store/reload round trip for the sqrt result, but it's not part of the loop-carried dependency chain. MSVC inline-asm syntax makes it hard to avoid store-forwarding round trips to get data into / out of inline asm. But worse, you produce the result on the x87 stack, and the compiler wants to do SSE math in XMM registers.

And then MSVC shoots itself in the foot for no reason, keeping the sum in memory instead of in an XMM register. It looks inside inline-asm statements to see which registers they affect, so IDK why it doesn't see that your inline-asm statement doesn't clobber any XMM regs.

So MSVC does a much worse job than necessary here:

00091290     movss       xmm0,dword ptr [eax]       # load from the array
00091294     movss       dword ptr [esp+6Ch],xmm0   # store to the stack
0009129A     fld         dword ptr [esp+6Ch]        # x87 load from stack
0009129E     fsqrt  
000912A0     fstp        dword ptr [esp+6Ch]        # x87 store to the stack
000912A4     movss       xmm0,dword ptr [esp+6Ch]   # SSE load from the stack (of sqrt(array[i]))
000912AA     add         eax,4  
000912AD     addss       xmm0,dword ptr [esp+54h]   # SSE load+add of the sum
000912B3     movss       dword ptr [esp+54h],xmm0   # SSE store of the sum

So it has the same loop-carried dependency chain (ADDSS + store-forwarding) as the function-call loop. Haswell FSQRT has one per 8-17 cycle throughput, so probably it's still the bottleneck. (All the stores/reloads involving the array value are independent for each iteration, and out-of-order execution can overlap many iterations to hide that latency chain. However, they will clog up the load/store execution units and sometimes delay the critical-path loads/stores by an extra cycle. This is called a resource conflict.)

Without /fp:fast, the sqrtf() library function has to set errno if the result is NaN. This is why it can't inline to just a SQRTSS.

If you did want to implement a no-checks scalar sqrt function yourself, you'd do it with Intel intrinsics syntax:

// DON'T USE THIS, it defeats auto-vectorization
static inline
float sqrt_scalar(float x) {
    __m128 xvec = _mm_set_ss(x);
    xvec =  _mm_cvtss_f32(_mm_sqrt_ss(xvec));
}

This compiles to a near-optimal scalar loop with gcc and clang (without -ffast-math). See it on the Godbolt compiler explorer:

# gcc6.2 -O3  for the sqrt_new loop using _mm_sqrt_ss.  good scalar code, but don't optimize further.
.L10:
    movss   xmm0, DWORD PTR [r12]
    add     r12, 4
    sqrtss  xmm0, xmm0
    addss   xmm1, xmm0
    cmp     r12, rbx
    jne     .L10

This loop should bottleneck only on SQRTSS throughput (one per 7 clocks on Haswell, notably faster than SQRTSD or FSQRT), and with no resource conflicts. However, it's still garbage compared to what you could do even without re-ordering the FP adds (since FP add/mul aren't truly associative): a smart compiler (or programmer using intrinsics) would use SQRTPS to get 4 results with the same throughput as 1 result from SQRTSS. Unpack the vector of SQRT results to 4 scalars, and then you can keep exactly the same order of operations with identical rounding of intermediate results. I'm disappointed that clang and gcc didn't do this.

However, gcc and clang do manage to actually avoid calling a library function. clang3.9 (with just -O3) uses SQRTSS without even checking for NaN. I assume that's legal, and not a compiler bug. Maybe it sees that the code doesn't use errno?

gcc6.2 on the other hand speculatively inlines sqrtf(), with a SQRTSS and a check on the input to see if it needs to call the library function.

# gcc6.2's sqrt() loop, without -ffast-math.
# speculative inlining of SQRTSS with a check + fallback
# spills/reloads a lot of stuff to memory even when it skips the call :(

# xmm1 = 0.0  (gcc -fverbose-asm says it's holding sqrt2, which is zero-initialized, so I guess gcc decides to reuse that zero)
.L9:
    movss   xmm0, DWORD PTR [rbx]
    sqrtss  xmm5, xmm0
    ucomiss xmm1, xmm0                  # compare input against 0.0
    movss   DWORD PTR [rsp+8], xmm5
    jbe     .L8                         # if(0.0 <= SQRTSS input || unordered(0.0, input)) { skip the function call; }
    movss   DWORD PTR [rsp+12], xmm1    # silly gcc, this store isn't needed.  ucomiss doesn't modify xmm1
    call    sqrtf                       # called for negative inputs, but not for NaN.
    movss   xmm1, DWORD PTR [rsp+12]
.L8:
    movss   xmm4, DWORD PTR [rsp+4]     # silly gcc always stores/reloads both, instead of putting the stores/reloads inside the block that the jbe skips
    addss   xmm4, DWORD PTR [rsp+8]
    add     rbx, 4
    movss   DWORD PTR [rsp+4], xmm4
    cmp     rbp, rbx
    jne     .L9

gcc unfortunately shoots itself in the foot here, the same way MSVC does with inline-asm: There's a store-forwarding round trip as a loop-carried dependency. All the spill/reloads could be inside the block skipped by the JBE. Maybe gcc things negative inputs will be common.

Even worse, if you do use /fp:fast or -ffast-math, even a clever compiler like clang doesn't manage to rewrite your _mm_sqrt_ss into a SQRTPS. Clang is normally pretty good at not just mapping intrinsics to instructions 1:1, and will come up with more optimal shuffles and blends if you miss an opportunity to combine things.

So with fast FP math enabled, using _mm_sqrt_ss is a big loss. clang compiles the sqrt() library function call version into RSQRTPS + a newton-raphson iteration.

Also note that your microbenchmark code isn't sensitive to the latency of your sqrt_new() implementation, only the throughput. Latency often matters in real FP code, not just throughput. But in other cases, like doing the same thing independently to many array elements, latency doesn't matter, because out-of-order execution can hide it well enough by having instructions in flight from many loop iterations.

As I mentioned earlier, latency from theextra store/reload round trip your data takes on its way in/out of MSVC-style inline-asm is a serious problem here. When MSVC inlines the function, the fld n doesn't come directly from the array.

BTW, Skylake has SQRTPS/SS throughput of one per 3 cycles, but still 12 cycle latency. SQRTPD/SD throughput = one per 4-6 cycles, latency = 15-16 cycles. So FP square root is more pipelined on Skylake than on Haswell. This magnifies the difference between benchmarking FP sqrt latency vs. throughput.

Attempt answered 25/9, 2016 at 3:31 Comment(0)

compiling in Release mode and with all optimizations turned on

They are not all turned on, you missed one. In the IDE it is Project > Properties > C/C++ > Code Generation > Floating Point Model. You left it at its default setting, /fp:precise. That has a very visible side-effect on the generated machine code:

  00091197 E8 F2 18 00 00       call        __libm_sse2_sqrt_precise (092A8Eh)

Perhaps it is intuitive enough that calling a helper function in the CRT is always slower than a inline instruction like FSQRT.

There is a lot to say about the exact semantics of /fp, the MSDN article about it is not very good. It is also hard to reverse-engineer, Microsoft purchased the code from Intel and could not obtain a source license that allowed them to re-publish the assembly code. Its original goal was certainly to deal with the horrid floating point consistency problems caused by Intel's 8087 FPU design. That is not so relevant today anymore, all mainstream C and C++ compilers now emit SSE2 code. MSVC++ does so since VS2012. These Intel library functions now mainly ensure that floating point operations still produce results that are consistent with older versions of the compiler.

__libm_sse2_sqrt_precise() does rather a lot. At the considerable risk of trying to document an undocumented function, I think I see it:

check the value of the MXCSR register, the control register for SSE. If it doesn't have its default value (0x1F80) then the code assumes that the programmer has used _control_fp() and jumps to the legacy sqrt() implementation that calculates with FPU semantics.
check the value of the FPU control register to see if floating point exceptions are enabled. Normally off, if any are enabled then it jumps to sqrt() as above.
checks if the argument is less than zero, but not minus 0, calls the user-supplied _matherr() function.
finally computes the result with the SQRTSD instruction.

None of this actually having anything to do with precision :) Seeing this execute at 85% perf is rather a good result aided however by FSQRT being substantially slower than SQRTSD. The latter got a lot more silicon love in modern processors.

If you care about fast floating point operations then change the setting to /fp:fast. Which produces:

  00D91310  sqrtsd      xmm0,xmm0

An inline instruction instead of a library call. In other words, skips the first 3 bullets in the previous list. Also beats FSQRT handily.

Regurgitation answered 24/9, 2016 at 10:26 Comment(5)

Great insight, thanks! For testing purposes, and if I can't go with the /fp:fast for the overall application, what would be the best way to specify a custom sqrt_new() function that boils down to just sqrtsd xmm0,xmm0? I think that would be the ideal comparison for my case. I tried float sqrt_new(float n) { _mm_cvtss_f32(_mm_sqrt_ss(_mm_set_ss(n))); }, but it's actually much slower than standard sqrt(n) whith /fp:fast – Hoehne 24/9, 2016 at 19:37

The setting applies to each source code file. So just move the "needs to be fast" code. – Regurgitation 24/9, 2016 at 20:50

Holy crap, that library function sounds worse than I could have imagined. I think glibc's libm sqrt just has to check for NaN and set errno, not check the MXCSR. Any thoughts on why the OP's code is converting to double at all, though? Is that just because he didn't #include <cmath>, so the float overload for sqrt(float) isn't available? But something must still declare sqrt(double) or it wouldn't compile, right? – Attempt 24/9, 2016 at 21:45

@Louis15: It's normal for /fp:fast to make the whole thing faster than scalar sqrt. It allows the whole thing (including the add) to vectorize (by treating FP math as associative). Your sqrt_new still does scalar sqrt instead of 4 at once (SQRTPS). Or maybe SQRTPD if your code is still calling sqrt(double) instead of sqrt(float). – Attempt 24/9, 2016 at 21:49

It is a library function with a 1-800 support phone number, hard to compare. I only focused on the OP's machine code, didn't try the source at all. – Regurgitation 24/9, 2016 at 21:49

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags