Fast hardware integer division

Hardware instruction for integer division has been historically very slow. For example, DIVQ on Skylake has latency of 42-95 cycles [1] (and reciprocal throughput of 24-90), for 64-bits inputs.

There are newer processor however, which perform much better: Goldmont has 14-43 latency and Ryzen has 14-47 latency [1], M1 apparently has "throughput of 2 clock cycles per divide" [2] and even Raspberry Pico has "8-cycle signed/unsigned divide/modulo circuit, per core" (though that seems to be for 32-bit inputs) [3].

My question is, what has changed? Was there a new algorithm invented? What algorithms do the new processors employ for division, anyway?

[1] https://www.agner.org/optimize/#manuals
[2] https://ridiculousfish.com/blog/posts/benchmarking-libdivide-m1-avx512.html
[3] https://raspberrypi.github.io/pico-sdk-doxygen/group__hardware__divider.html#details

On Intel before Ice Lake, 64-bit operand-size is an outlier, much slower than 32-bit operand size for integer division. div r32 is 10 uops, with 26 cycle worst-case latency but 6 cycle throughput. (https://uops.info/ and https://agner.org/optimize/, and Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux has detailed exploration.)

There wasn't a fundamental change in how divide units are built, just widening the HW divider to not need extended-precision microcode. (Intel has had fast-ish dividers for FP for much longer, and that's basically the same problem just with only 53 bits instead of 64. The hard part of FP division is integer division of the mantissas; subtracting the exponents is easy and done in parallel.)

The incremental changes are things like widening the radix to handle more bits with each step. And for example pipelining the refinement steps after the initial (table lookup?) value, to improve throughput but not latency.

How sqrt() of GCC works after compiled? Which method of root is used? Newton-Raphson? brief high-level overview of the div/sqrt units that modern CPUs use, with for example a Radix-1024 divider being new in Broadwell.
Do FP and integer division compete for the same throughput resources on x86 CPUs? (No in Ice Lake and later on Intel; having a dedicated integer unit instead of using the low element of the FP mantissa divide/sqrt unit is presumably related to making it 64 bits wide.)

Divide units historically were often not pipelined at all, as that's hard because it requires replicating a lot of gates instead of iterating on the same multipliers, I think. And most software usually avoids (or avoided) integer division because it was historically very expensive, at least does it infrequently enough to not benefit very much from higher-throughput dividers with the same latency.

But with wider CPU pipelines with higher IPC shrinking the cycle gap between divisions, it's more worth doing. Also with huge transistor budgets, spending a bunch on something that will sit idle for a lot of the time in most programs still makes sense if it's very helpful for a few programs. (Like wider SIMD, and specialized execution units like x86 BMI2 pdep / pext). Dark silicon is necessary or chips would melt; power density is a huge concern, see Modern Microprocessors: A 90-Minute Guide!

Also, more and more software being written by people who don't know anything about performance, and more code avoiding compile-time constants in favour of being flexible (function args that ultimately come from some config option), I'd guess modern software doesn't avoid division as much as older programs did.

Floating-point division is often harder to avoid than integer, so it's definitely worth having fast FP dividers. And integer can borrow the mantissa divider from the low SIMD element, if there isn't a dedicated integer-divide unit.

So that FP motivation was likely the actual driving force behind Intel's improvements to divide throughput and latency even though they left 64-bit integer division with garbage performance until Ice Lake.

Recommended topics

Hot tags