Is it possible to write Quake's fast InvSqrt() function in Rust?

Asked 28/11, 2019 at 4:45 Answered 28/11, 2019 at 7:40

113

This is just to satisfy my own curiosity.

Is there an implementation of this:

float InvSqrt (float x)
{
   float xhalf = 0.5f*x;
   int i = *(int*)&x;
   i = 0x5f3759df - (i>>1);
   x = *(float*)&i;
   x = x*(1.5f - xhalf*x*x);
   return x;
}

in Rust? If it exists, post the code.

I tried it and failed. I don't know how to encode the float number using integer format. Here is my attempt:

fn main() {
    println!("Hello, world!");
    println!("sqrt1: {}, ",sqrt2(100f64));
}

fn sqrt1(x: f64) -> f64 {
    x.sqrt()
}

fn sqrt2(x: f64) -> f64 {
    let mut x = x;
    let xhalf = 0.5*x;
    let mut i = x as i64;
    println!("sqrt1: {}, ", i);

    i = 0x5f375a86 as i64 - (i>>1);

    x = i as f64;
    x = x*(1.5f64 - xhalf*x*x);
    1.0/x
}

Reference:
1. Origin of Quake3's Fast InvSqrt() - Page 1
2. Understanding Quake’s Fast Inverse Square Root
3. FAST INVERSE SQUARE ROOT.pdf
4. source code: q_math.c#L552-L572

Ischium answered 28/11, 2019 at 4:45 Comment(7)

the C# Version: Is it possible to write Quake's fast InvSqrt() function in C#? – Ischium 28/11, 2019 at 4:46

As I understand it, this code is UB in C due to violating the strict aliasing rule. The standard-blessed way to perform this kind of type punning is with a union. – Squander 28/11, 2019 at 12:3

@trentcl: I don't think union works either. memcpy definitely works, though it's verbose. – Metaphysic 28/11, 2019 at 16:10

@MatthieuM. Type punning with unions is perfectly valid C, but not valid C++. – Lazare 28/11, 2019 at 16:43

@Bergi Threre is a logical bug, hope it won't bother you: wrong code – Ischium 29/11, 2019 at 8:54

I suppose this question is fine from a pure-curiosity perspective, but please understand that times have changed. On x86, the rsqrtss and rsqrtps instructions, introduced with the Pentium III in 1999, are faster and more accurate than this code. ARM NEON has vrsqrte which is similar. And whatever calculations Quake III used this for would probably be done on the GPU these days anyway. – Phillie 30/11, 2019 at 2:35

The term you're looking for is "type pun" float to integer. – Calida 30/11, 2019 at 10:48

100

I don't know how to encode the float number using integer format.

There is a function for that: f32::to_bits which returns an u32. There is also the function for the other direction: f32::from_bits which takes an u32 as argument. These functions are preferred over mem::transmute as the latter is unsafe and tricky to use.

With that, here is the implementation of InvSqrt:

fn inv_sqrt(x: f32) -> f32 {
    let i = x.to_bits();
    let i = 0x5f3759df - (i >> 1);
    let y = f32::from_bits(i);

    y * (1.5 - 0.5 * x * y * y)
}

(Playground)

This function compiles to the following assembly on x86-64:

.LCPI0_0:
        .long   3204448256        ; f32 -0.5
.LCPI0_1:
        .long   1069547520        ; f32  1.5
example::inv_sqrt:
        movd    eax, xmm0
        shr     eax                   ; i << 1
        mov     ecx, 1597463007       ; 0x5f3759df
        sub     ecx, eax              ; 0x5f3759df - ...
        movd    xmm1, ecx
        mulss   xmm0, dword ptr [rip + .LCPI0_0]    ; x *= 0.5
        mulss   xmm0, xmm1                          ; x *= y
        mulss   xmm0, xmm1                          ; x *= y
        addss   xmm0, dword ptr [rip + .LCPI0_1]    ; x += 1.5
        mulss   xmm0, xmm1                          ; x *= y
        ret

I have not found any reference assembly (if you have, please tell me!), but it seems fairly good to me. I am just not sure why the float was moved into eax just to do the shift and integer subtraction. Maybe SSE registers do not support those operations?

clang 9.0 with -O3 compiles the C code to basically the same assembly. So that's a good sign.

It is worth pointing out that if you actually want to use this in practice: please don't. As benrg pointed out in the comments, modern x86 CPUs have a specialized instruction for this function which is faster and more accurate than this hack. Unfortunately, 1.0 / x.sqrt() does not seem to optimize to that instruction. So if you really need the speed, using the _mm_rsqrt_ps intrinsics is probably the way to go. This, however, does again require unsafe code. I won't go into much detail in this answer, as a minority of programmers will actually need it.

Swizzle answered 28/11, 2019 at 7:40 Comment(5)

According to the Intel Intrinsics Guide there is no integer shift operation that only shifts the lowest 32-bit of the 128-bit register analog to addss or mulss. But if the other 96 bits of xmm0 can be ignored then one could use the psrld instruction. Same goes for integer subtraction. – Doll 28/11, 2019 at 14:12

I'll admit to knowing next to nothing about rust, but isn't "unsafe" basically a core property of fast_inv_sqrt? With it's total disrespect for datatypes and such. – Groggery 28/11, 2019 at 14:34

@Groggery It's a different type of "unsafe" we talk about though. A fast approximation which gets a bad value too far from the sweet spot, versus something playing fast and loose with undefined behavior. – Swearword 28/11, 2019 at 16:2

@Gloweye: Mathematically, the last part of that fast_inv_sqrt is just one Newton-Raphson iteration step to find a better approximation of inv_sqrt. There's nothing unsafe about that part. The trickery is in the first part, which finds a good approximation. That works because it's doing an integer divide by 2 on the exponent part of the float, and indeed sqrt(pow(0.5,x))=pow(0.5,x/2) – Himalayas 29/11, 2019 at 14:10

@fsasm: That's correct; movd to EAX and back is a missed optimization by current compilers. (And yes, calling conventions pass/return scalar float in the low element of an XMM and allow high bits to be garbage. But note that if it was zero-extended, it can easily stay that way: right shifting doesn't introduce non-zero elements and neither does subtraction from _mm_set_epi32(0,0,0,0x5f3759df), i.e. a movd load. You would need a movdqa xmm1,xmm0 to copy the reg before psrld. Bypass latency from FP instruction forwarding to integer and vice versa is hidden by mulss latency. – Calida 30/11, 2019 at 10:31

This one is implemented with less known union in Rust:

union FI {
    f: f32,
    i: i32,
}

fn inv_sqrt(x: f32) -> f32 {
    let mut u = FI { f: x };
    unsafe {
        u.i = 0x5f3759df - (u.i >> 1);
        u.f * (1.5 - 0.5 * x * u.f * u.f)
    }
}

Did some micro benchmarks using criterion crate on a x86-64 Linux box. Surprisingly Rust's own sqrt().recip() is the fastest. But of course, any micro benchmark result should be taken with a grain of salt.

inv sqrt with transmute time:   [1.6605 ns 1.6638 ns 1.6679 ns]
inv sqrt with union     time:   [1.6543 ns 1.6583 ns 1.6633 ns]
inv sqrt with to and from bits
                        time:   [1.7659 ns 1.7677 ns 1.7697 ns]
inv sqrt with powf      time:   [7.1037 ns 7.1125 ns 7.1223 ns]
inv sqrt with sqrt then recip
                        time:   [1.5466 ns 1.5488 ns 1.5513 ns]

Dissipate answered 28/11, 2019 at 5:23 Comment(13)

I am not in the least surprised sqrt().inv() is fastest. Both sqrt and inv are single instructions these days, and go pretty fast. Doom was written in the days when it wasn't safe to assume there was hardware floating point at all, and transcendental functions like sqrt would definitely have been software. +1 for the benchmarks. – Quechua 28/11, 2019 at 17:14

What surprises me is that transmute is apparently different from to_ and from_bits -- I'd expect those to be instruction-equivalent even before optimization. – Squander 28/11, 2019 at 23:15

@MartinBonnersupportsMonica Just to note, that's been true for x86 for a long time, but on ARM we do need to check versions. The Cortex-M4 only has full hardware support for single-precision floats, where the Cortex-M7 does single- and double-precision. So a modern tablet should be fine, but little embedded things may not be. These kind of tricks are nice to remember for those of us who still need to count RAM in bytes. :) – Allegra 29/11, 2019 at 12:40

@Allegra An excellent point! (The benchmark will almost certainly have been done on a platform with hardware FP, but there are probably more installed processors out there that don't do FP than that do.) It's a while since I hard to care about space - and then I was counting in (16-bit) words. – Quechua 29/11, 2019 at 12:59

@Graham: this particular implementation is restricted to single-precision floats too, so even on an M4 it has limited value. – Himalayas 29/11, 2019 at 14:12

@Himalayas unless the range of double precision floats is required, this is an approximation algorithm so that I figured single precision probably makes more sense. – Dissipate 29/11, 2019 at 14:44

@MartinBonner All x86 FPUs since the original 8087 have supported a hardware fsqrt, as well as transcendental functions like fsin, etc. However they were microcoded and much slower than InvSqrt would have been. Today you should use rsqrtss or rsqrtps if you need a fast approximate inverse square root. – Phillie 30/11, 2019 at 2:39

@MartinBonner (Also, not that it matters, but sqrt isn't a transcendental function.) – Phillie 30/11, 2019 at 2:40

@MartinBonner: Any hardware FPU that supports division will normally also support sqrt. IEEE "basic" operations (+ - * / sqrt) are required to produce a correctly-rounded result; that's why SSE provides all of those operations but not exp, sin, or whatever. In fact, divide and sqrt typically run on the same execution unit, designed a similar way. See HW div/sqrt unit details. Anyway, they're still not fast compared to multiply, especially in latency. – Calida 30/11, 2019 at 10:36

@benrg: fsqrt / sqrtps are not microcoded; they're IEEE Basic operations directly supported by HW as a single uop, like rsqrtps. See my previous comment. – Calida 30/11, 2019 at 10:38

@Dissipate was your microbenchmark measuring latency or throughput? Modern div/sqrt units are somewhat pipelined but not fully. (e.g. Skylake ~12 cycle latency, one per 3 cycle throughput for independent float inputs, or somewhat worse for double). But multiple is more heavily pipelined, e.g. 4 cycle latency and 0.5 cycle reciprocal throughput (so 8 in flight simultaneously). Your times of 1.6ns are only ~6 clock cycles at 4GHz which is unreasonably short; you're probably measuring throughput of sqrt+div averaged over a repeat loop. Latency for HW sqrt + div is maybe better too. – Calida 30/11, 2019 at 10:43

Anyway, Skylake has significantly better pipelining for div/sqrt than previous uarches. See Floating point division vs floating point multiplication for some extracts from Agner Fog's table. If you aren't doing much other work in a loop so sqrt + div is a bottleneck, you might want to use HW fast reciprocal sqrt (instead of the quake hack) + a Newton iteration. Especially with FMA that's good for throughput, if not latency. Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision – Calida 30/11, 2019 at 10:44

@PeterCordes it is throughput indeed. I literally wrapped the function call in a closure and fed it to an iterator as dictated by criterion API, which in turn took care of warmup, measurement and statistical analysis. This is the Mandalorian^H^H^H^H criterion way. – Dissipate 1/12, 2019 at 0:51

You may use std::mem::transmute to make needed conversion:

fn inv_sqrt(x: f32) -> f32 {
    let xhalf = 0.5f32 * x;
    let mut i: i32 = unsafe { std::mem::transmute(x) };
    i = 0x5f3759df - (i >> 1);
    let mut res: f32 = unsafe { std::mem::transmute(i) };
    res = res * (1.5f32 - xhalf * res * res);
    res
}

You can look for a live example here: here

Pren answered 28/11, 2019 at 5:5 Comment(2)

There's nothing wrong with unsafe, but there's a way to do this without explicit unsafe block, so I'd suggest to rewrite this answer using f32::to_bits and f32::from_bits. It also carries the intent clearly unlike transmute, which most people probably look at as "magic". – Grigri 28/11, 2019 at 7:37

@Sahsahae I just posted an answer using the two functions you mentioned :) And I agree, unsafe should be avoided here, as it's not necessary. – Swizzle 28/11, 2019 at 7:40

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags