Is there a fast C or C++ standard library function for double precision inverse square root?

Asked 16/10, 2012 at 21:25 Answered 7/6, 2020 at 1:7

I find myself typing

double foo=1.0/sqrt(...);

a lot, and I've heard that modern processors have built-in inverse square root opcodes.

Is there a C or C++ standard library inverse square root function that

uses double precision floating point?
is as accurate as 1.0/sqrt(...)?
is just as fast or faster than the result of 1.0/sqrt(...)?

Agma answered 16/10, 2012 at 21:25 Comment(8)

@Pherric Oxide: That was inverse square, not inverse square root. – Agma 16/10, 2012 at 21:28

#define INSQRT(x) (1.0/sqrt(x)) – Kroo 16/10, 2012 at 21:29

Must it work in "C or C++" or "C and C++"? – Brewer 16/10, 2012 at 21:37

Is there a way you can rearrange the maths to do intermediate work in squares, then take a minimal number of square roots at the end? – Munmro 16/10, 2012 at 21:43

The built-in inverse square root instruction that you've heard of is an approximation, not as exact as sqrt. See tommesani.com/SSEReciprocal.html – Chee 16/10, 2012 at 21:47

Not as fast as in Quake III Arena perhaps. – Undressed 16/10, 2012 at 21:48

@Mark Ransom: That's basically the answer I was looking for. – Agma 17/10, 2012 at 1:4

bugs.llvm.org/show_bug.cgi?id=20900 – Mannerism 6/9, 2018 at 5:16

No. No, there isn't. Not in C++. Nope.

Meliorate answered 16/10, 2012 at 21:26 Comment(0)

You can use this function for faster inverse square root computing
There's an article on wikipedia on how it works: https://en.wikipedia.org/wiki/Fast_inverse_square_root
Also there's a C version of this algorithm.

float invSqrt( float number ){
    union {
        float f;
        uint32_t i;
    } conv;

    float x2;
    const float threehalfs = 1.5F;

    x2 = number * 0.5F;
    conv.f  = number;
    conv.i  = 0x5f3759df - ( conv.i >> 1 );
    conv.f  = conv.f * ( threehalfs - ( x2 * conv.f * conv.f ) );
    return conv.f;
}

Neogene answered 22/12, 2018 at 4:58 Comment(3)

Reading about this is what gave me the idea for the question. An article about the fast inverse square root said that some hardware had inverse square root instructions because of how much it comes up in graphics code. I couldn't use this algorithm at the time because I needed full double precision, but have an upvote for anyone reading this answer who hasn't heard of it :). – Agma 22/12, 2018 at 5:12

The "naive" version actually compiles down to fewer instructions: godbolt.org/z/aGWbP5qvs – Manlove 27/10, 2023 at 8:14

Important to mention in C++ type punning using a union is UB. Use C++20 std::bitcast or memcpy float data into the uint32_t. – Une 20/9 at 14:41

I don't know of a standardized C API for this, but that does not mean you cannot use the fast inverse sqrt instructions, as long as you are willing to write platform dependent intrinsics.

Let's take 64-bit x86 with AVX for example, where you can use _mm256_rsqrt_ps() to approximate the reciprocal of a square root. Or more specifically: 8 square-roots in a single go, using SIMD.

#include <immintrin.h>

...

float inputs[8] = { ... } __attribute__ ((aligned (32)));
__m256 input = _mm256_load_ps(inputs);
__m256 invroot = _mm256_rsqrt_ps(input);

Similarly, you can use the intrinsic vrsqrteq_f32 on ARM with NEON. In this case, the SIMD is 4-wide, so it will compute four inverse square roots in a single go.

#include <arm_neon.h>

...

float32x4_t sqrt_reciprocal = vrsqrteq_f32(x);

Even if you need just one root value per batch, it is still faster than a full square root. Just set the input in all, or one lane of the SIMD register. That way, you will not have to go through your memory with a load operation. On x86 that is done via _mm256_set1_ps(x).

Lefebvre answered 7/6, 2020 at 1:7 Comment(1)

When you say "it is still faster than a full square root", do you mean "it is still faster than an inverse and square root"? – Iraidairan 25/5 at 2:33

Violating constraints 1. and 2. (and it's also not standard), but it still might help someone browsing through...

I used ASMJIT to just-in-time compile the exact assembly operation you're looking for: RSQRTSS (single precision, ok, but it should be similar with double).

My code is this (cf. also my answer in a different post):

   typedef float(*JITFunc)();

   JITFunc func;
   asmjit::JitRuntime jit_runtime;
   asmjit::CodeHolder code;
   code.init(jit_runtime.getCodeInfo());

   asmjit::X86Compiler cc(&code);
   cc.addFunc(asmjit::FuncSignature0<float>());

   float value = 2.71; // Some example value.
   asmjit::X86Xmm x = cc.newXmm();
   uint32_t *i = reinterpret_cast<uint32_t*>(&value);
   cc.mov(asmjit::x86::eax, i[0]);
   cc.movd(x, asmjit::x86::eax);

   cc.rsqrtss(x, x);   // THE asm function.

   cc.ret(x);

   cc.endFunc();
   cc.finalize();

   jit_runtime.add(&func, &code);

   // Now, func() can be used as the result to rsqrt(value).

If you do the JIT compilation part only once, calling it later with different values, this should be faster (though slightly less accurate, but this is inherent to the built-in operations you're talking about) than 1.0/sqrt(...).

Crackleware answered 30/10, 2019 at 10:10 Comment(0)

-3

If your not afraid of using your own functions, try the following:

template <typename T>
T invsqrt(T x)
{
    return 1.0 / std::sqrt(x);
}

It should be just as fast as the orginal 1.0 / std::sqrt(x) in any modernly optimized compiler. Also, it can be used with doubles or floats.

Brewer answered 16/10, 2012 at 21:42 Comment(4)

violates rule#3 in the question! – Kroo 16/10, 2012 at 21:53

Sorry, as I understand, it should be "just as fast". – Brewer 16/10, 2012 at 21:54

Read #2442858 to see why or why not template functions should be slower than non-templated code. – Brewer 16/10, 2012 at 21:57

Also, if you enable -ffast-math in gcc, it will use an approximation to inverse square root. This would ensure it is just as fast/faster than regular square root. – Heber 16/9, 2016 at 22:52

-5

why not try this? #define INSQRT(x) (1.0/sqrt(x))

Its just as fast, requires less typing(makes you feel like its a function), uses double precision, as accurate as 1/sqrt(..)

Kroo answered 16/10, 2012 at 21:31 Comment(4)

I didn't downvote, but here's no use for a macro here when a function will do. (You even said it yourself: make it feel like a function? Just actually make a function.) – Courtmartial 16/10, 2012 at 21:43

@Courtmartial The reason I didn't convert it to a function is, because the question clearly mentions: "is just as fast or faster than the result of 1.0/sqrt(...)". Making it a function will add additional overhead making the "statement" 1.0/sqrt(...) SLOWER. – Kroo 16/10, 2012 at 21:48

Not on any compiler from the last decade. – Courtmartial 16/10, 2012 at 21:54

@PrototypeStark: Please provide benchmarks to back up your claim that using a real function will be slower. Macros may safely be avoided in the absence of evidence that they are required to meet some criterion. That said, I always carry my #define isNaN(x) ((x)!=(x)) around with me; sometimes it just feels good to be so bad. – Meliorate 17/10, 2012 at 5:56

-5

If you find yourself writing the same thing over and over, you should think to yourself "function!":

double invsqrt(const double x)
{
    return 1.0 / std::sqrt(x);
}

Now the code is more self-documenting: people don't have to deduce 1.0 / std::sqrt(x) is the inverse square root, they read it. Additionally, you now get to plug in whatever implementation you want and each call-site automatically uses the updated definition.

To answer your question, no, there is no C(++) function for it, but now that you've made one if you find your performance is too lacking you can substitute your own definition.

Courtmartial answered 16/10, 2012 at 21:41 Comment(9)

violates rule#3 in the question – Kroo 16/10, 2012 at 21:52

why rely on the compiler when you can use the preprocessor? I still think I didn't deserve the -ve vote :-( – Kroo 16/10, 2012 at 21:55

@PrototypeStark: Because it's not as simple as either-or. One is type-checked, debugabble, scopable, overloadable, evaluates its argument is an expression once, etc. (all the features of a function), the other is not. And it's a single downvote, it's not the end of the world; I understand it's frustrating not getting a reason from the person themselves, but that's how it is. – Courtmartial 16/10, 2012 at 22:3

he just downvoted and left yeah its frustrating. I find it funny too though. – Kroo 16/10, 2012 at 22:6

I would argue that it is easier to read 1.0/sqrt(x) as the inverse square root than invsqrt(x), the former using the less ambiguous mathematical notation as opposed to an abbreviation. – Agma 17/10, 2012 at 1:10

@Dan: That's fine, it's up to you what you find easy to read of course. But over time I think you'll find it's much better in principle to hide details, including what constitutes an inverse square root. – Courtmartial 17/10, 2012 at 1:17

"inverse square root" just means the reciprocal of the square root, though. It's not a detail, it's literally what the function name means. – Agma 17/10, 2012 at 1:28

@Dan: You're conflating implementation with specification. You're right, inverse square root is the reciprocal of the square root, but how to do get from that to 1.0 / sqrt(x)? It's not hard, of course, but that's not the point: it's still a division from specification to implementation. Hide the implementation, keep the specification; it makes reasoning and maintaining about your program easier. Consider how easy it would be to optimize every single inverse square root calculation in your entire program, just by changing the implementation and keeping the specification. – Courtmartial 17/10, 2012 at 1:37

@GManNickG: While I'd usually be the first to agree with that logic, there are limits. You wouldn't write a function multiplyByTwo -- you'd write *2. Personally I'd say the inverse square root example is right on the borderline. – Meliorate 17/10, 2012 at 5:55

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags