I've been profiling some of our core math on an Intel Core Duo, and while looking at various approaches to square root I've noticed something odd: using the SSE scalar operations, it is faster to take a reciprocal square root and multiply it to get the sqrt, than it is to use the native sqrt opcode!
I'm testing it with a loop something like:
inline float TestSqrtFunction( float in );
void TestFunc()
{
#define ARRAYSIZE 4096
#define NUMITERS 16386
float flIn[ ARRAYSIZE ]; // filled with random numbers ( 0 .. 2^22 )
float flOut [ ARRAYSIZE ]; // filled with 0 to force fetch into L1 cache
cyclecounter.Start();
for ( int i = 0 ; i < NUMITERS ; ++i )
for ( int j = 0 ; j < ARRAYSIZE ; ++j )
{
flOut[j] = TestSqrtFunction( flIn[j] );
// unrolling this loop makes no difference -- I tested it.
}
cyclecounter.Stop();
printf( "%d loops over %d floats took %.3f milliseconds",
NUMITERS, ARRAYSIZE, cyclecounter.Milliseconds() );
}
I've tried this with a few different bodies for the TestSqrtFunction, and I've got some timings that are really scratching my head. The worst of all by far was using the native sqrt() function and letting the "smart" compiler "optimize". At 24ns/float, using the x87 FPU this was pathetically bad:
inline float TestSqrtFunction( float in )
{ return sqrt(in); }
The next thing I tried was using an intrinsic to force the compiler to use SSE's scalar sqrt opcode:
inline void SSESqrt( float * restrict pOut, float * restrict pIn )
{
_mm_store_ss( pOut, _mm_sqrt_ss( _mm_load_ss( pIn ) ) );
// compiles to movss, sqrtss, movss
}
This was better, at 11.9ns/float. I also tried Carmack's wacky Newton-Raphson approximation technique, which ran even better than the hardware, at 4.3ns/float, although with an error of 1 in 210 (which is too much for my purposes).
The doozy was when I tried the SSE op for reciprocal square root, and then used a multiply to get the square root ( x * 1/√x = √x ). Even though this takes two dependent operations, it was the fastest solution by far, at 1.24ns/float and accurate to 2-14:
inline void SSESqrt_Recip_Times_X( float * restrict pOut, float * restrict pIn )
{
__m128 in = _mm_load_ss( pIn );
_mm_store_ss( pOut, _mm_mul_ss( in, _mm_rsqrt_ss( in ) ) );
// compiles to movss, movaps, rsqrtss, mulss, movss
}
My question is basically what gives? Why is SSE's built-in-to-hardware square root opcode slower than synthesizing it out of two other math operations?
I'm sure that this is really the cost of the op itself, because I've verified:
- All data fits in cache, and accesses are sequential
- the functions are inlined
- unrolling the loop makes no difference
- compiler flags are set to full optimization (and the assembly is good, I checked)
(edit: stephentyrone correctly points out that operations on long strings of numbers should use the vectorizing SIMD packed ops, like rsqrtps
— but the array data structure here is for testing purposes only: what I am really trying to measure is scalar performance for use in code that can't be vectorized.)
SSESqrt
so that it takes afloat
and returns afloat
instead of taking 2 pointers? – Mislikeinline float SSESqrt( float restrict fIn ) { float fOut; _mm_store_ss( &fOut, _mm_sqrt_ss( _mm_load_ss( &fIn ) ) ); return fOut; }
. But this is a bad idea because it can easily induce a load-hit-store stall if the CPU writes the floats to the stack and then reads them back immediately -- juggling from the vector register to a float register for the return value in particular is bad news. Besides, the underlying machine opcodes that the SSE intrinsics represent take address operands anyway. – Brassardnormalize(Vector3f& v)
function that doesv *= invSqrt(norm2(v));
. Does it mean I'm facing 2 LHS situations: first wheninvSqrt
reads the result ofnorm2
, second when*=
reads the result ofinvSqrt
? (and do I really need to worry about it on x86?) – Mislikeeax
) is very bad, while a round trip between xmm0 and stack and back isn't, because of Intel's store-forwarding. You can time it yourself to see for sure. Generally the easiest way to see potential LHS is to look at the emitted assembly and see where data is juggled between register sets; your compiler might do the smart thing, or it might not. As to normalizing vectors, I wrote up my results here: bit.ly/9W5zoU – Brassard_mm_store_ss
/_mm_load_ss
, usereturn _mm_cvtss_f32(_mm_sqrt_ss(_mm_set_ss(fIn)));
! There's no need for round-trip to memory. This is particularly useful in code where the compiler is using scalar SSE for math already. – Spiritualty