I don't know how to encode the float number using integer format.
There is a function for that: f32::to_bits
which returns an u32
. There is also the function for the other direction: f32::from_bits
which takes an u32
as argument. These functions are preferred over mem::transmute
as the latter is unsafe
and tricky to use.
With that, here is the implementation of InvSqrt
:
fn inv_sqrt(x: f32) -> f32 {
let i = x.to_bits();
let i = 0x5f3759df - (i >> 1);
let y = f32::from_bits(i);
y * (1.5 - 0.5 * x * y * y)
}
(Playground)
This function compiles to the following assembly on x86-64:
.LCPI0_0:
.long 3204448256 ; f32 -0.5
.LCPI0_1:
.long 1069547520 ; f32 1.5
example::inv_sqrt:
movd eax, xmm0
shr eax ; i << 1
mov ecx, 1597463007 ; 0x5f3759df
sub ecx, eax ; 0x5f3759df - ...
movd xmm1, ecx
mulss xmm0, dword ptr [rip + .LCPI0_0] ; x *= 0.5
mulss xmm0, xmm1 ; x *= y
mulss xmm0, xmm1 ; x *= y
addss xmm0, dword ptr [rip + .LCPI0_1] ; x += 1.5
mulss xmm0, xmm1 ; x *= y
ret
I have not found any reference assembly (if you have, please tell me!), but it seems fairly good to me. I am just not sure why the float was moved into eax
just to do the shift and integer subtraction. Maybe SSE registers do not support those operations?
clang 9.0 with -O3
compiles the C code to basically the same assembly. So that's a good sign.
It is worth pointing out that if you actually want to use this in practice: please don't. As benrg pointed out in the comments, modern x86 CPUs have a specialized instruction for this function which is faster and more accurate than this hack. Unfortunately, 1.0 / x.sqrt()
does not seem to optimize to that instruction. So if you really need the speed, using the _mm_rsqrt_ps
intrinsics is probably the way to go. This, however, does again require unsafe
code. I won't go into much detail in this answer, as a minority of programmers will actually need it.
union
. – Squanderunion
works either.memcpy
definitely works, though it's verbose. – Metaphysicrsqrtss
andrsqrtps
instructions, introduced with the Pentium III in 1999, are faster and more accurate than this code. ARM NEON hasvrsqrte
which is similar. And whatever calculations Quake III used this for would probably be done on the GPU these days anyway. – Phillie