How sqrt() of GCC works after compiled? Which method of root is used? Newton-Raphson?

Asked 12/2, 2019 at 4:7 Answered 12/2, 2019 at 4:29

Just curiosity about the standard sqrt() from math.h on GCC works. I coded my own sqrt() using Newton-Raphson to do it!

Proem answered 12/2, 2019 at 4:7 Comment(8)

Why don't you look at the implementation? GCC is open source. Note that on many architectures these days, the sqrt might be replaced with an explicit CPU instruction that performs the calculation. – Parasympathetic 12/2, 2019 at 4:12

I only see the description of the function on math.h – Proem 12/2, 2019 at 4:14

yeah, I know fsqrt. But how the CPU does it? I can't debug hardware – Proem 12/2, 2019 at 4:14

If the square root is calculated at compile time, I believe it's done using the mpfr library. At run time, it's going to depend on your processor and libc implementation. Might be an instruction, might be a function call... – Gault 12/2, 2019 at 4:17

I think regardless, these libraries are maintained by some of the smartist people that always look for the most optimized solutions. So if the question really being if their implementation is the fastest, answer would be yes... – Bloke 12/2, 2019 at 4:22

The implementation is not in math.h -- that's the interface. Look at the source code for GCC. Regarding your question about sqrt on hardware, that's achieved through microinstructions specific to the hardware. Contact your hardware manufacturer to obtain datasheets if you need to know the algorithm used. – Parasympathetic 12/2, 2019 at 4:23

@paddy, I don't understand that. I'm almost sure the interface has software below, it. NOW, the way your hardware actually achieves those instructions can vary... – Bloke 12/2, 2019 at 4:26

@OmidCompSCI Yes, it does have a software implementation, but because it's a standard library function the compiler can choose to replace it with a similar hardware instruction. It may even optimize loops to perform multiple calculations in parallel through vectorized instructions (e.g. SIMD) – Parasympathetic 12/2, 2019 at 4:30

yeah, I know fsqrt. But how the CPU does it? I can't debug hardware

Typical div/sqrt hardware in modern CPUs uses a power of 2 radix to calculate multiple result bits at once. e.g. http://www.imm.dtu.dk/~alna/pubs/ARITH20.pdf presents details of a design for a Radix-16 div/sqrt ALU, and compares it against the design in Penryn. (They claim lower latency and less power.) I looked at the pictures; looks like the general idea is to do something and feed a result back through a multiplier and adder iteratively, basically like long division. And I think similar to how you'd do bit-at-a-time division in software.

Intel Broadwell introduced a Radix-1024 div/sqrt unit. This discussion on RWT asks about changes between Penryn (Radix-16) and Broadwell. e.g. widening the SIMD vector dividers so 256-bit division was less slow vs. 128-bit, as well as increasing radix.

Maybe also see

The integer division algorithm of Intel's x86 processors - Merom's Radix-2 and Radix-4 dividers was replaced by Penryn's Radix-16. (Core2 65nm vs. 45nm). Also https://specbranch.com/posts/faster-div8/ has some guesswork about how integer div might left-justify the bits and use the same hardware as RCPSS to get an initial guess, and so on.
https://electronics.stackexchange.com/questions/280673/why-does-hardware-division-take-much-longer-than-multiplication
https://scicomp.stackexchange.com/questions/187/why-is-division-so-much-more-complex-than-other-arithmetic-operations

But however the hardware works, IEEE requires sqrt (and mul/div/add/sub) to give a correctly rounded result, i.e. error <= 0.5 ulp, so you don't need to know how it works, just the performance. These operations are special, other functions like log and sin do not have this requirement, and real library implementations usually aren't that accurate. (And x87 fsin is definitely not that accurate for inputs near Pi/2 where catastrophic cancellation in range-reduction leads to potentially huge relative errors.)

See https://agner.org/optimize/ for x86 instruction tables including throughput and latency for scalar and SIMD sqrtsd / sqrtss and their wider versions. I collected up the results in Floating point division vs floating point multiplication

For non-x86 hardware sqrt, you'd have to look at data published by other vendors, or results from people who have tested it.

Unlike most instructions, sqrt performance is typically data-dependent. (Usually more significant bits or larger magnitude of the result takes longer).

Miskolc answered 12/2, 2019 at 4:29 Comment(0)

sqrt is defined by C, so most likely you have to look in glibc.

You did not specify which architecture you are asking for, so I think it's safe to assume x86-64. If that's the case, they are defined in:

tl;dr they are simply implemented by calling the x86-64 square root instructions sqrts{sd}:

Furthermore, and just for the sake of discussion, if you enable fast-math (something you probably should not do if you care about result precision), you will see that most compilers will actually inline the call and directly emit the sqrts{sd} instructions:

https://godbolt.org/z/Wb4unC

Parse answered 12/2, 2019 at 4:26 Comment(2)

Compilers can fully inline sqrt as long as you use -fno-math-errno. Otherwise GCC inlines it plus a compare&branch to call the library version for NaN inputs, because of the stupid requirement left over from an early C standard that math functions set errno on FP errors like negative inputs to sqrt. (Basically on FP invalid exceptions). – Miskolc 12/2, 2019 at 4:37

With the full -ffast-math, compilers sometimes used to use rsqrtps + a Newton iteration to approximate sqrt(x) as x * approx_recip_sqrt(x), although that requires a fixup for non-zero x. And it's not worth doing on modern x86, especially for scalar. Maybe for vector if it's the only thing you're doing in a loop, otherwise still not. – Miskolc 12/2, 2019 at 4:37

Recommended topics

Hot tags