Fast hardware integer division
Asked Answered
O

1

9

Hardware instruction for integer division has been historically very slow. For example, DIVQ on Skylake has latency of 42-95 cycles [1] (and reciprocal throughput of 24-90), for 64-bits inputs.

There are newer processor however, which perform much better: Goldmont has 14-43 latency and Ryzen has 14-47 latency [1], M1 apparently has "throughput of 2 clock cycles per divide" [2] and even Raspberry Pico has "8-cycle signed/unsigned divide/modulo circuit, per core" (though that seems to be for 32-bit inputs) [3].

My question is, what has changed? Was there a new algorithm invented? What algorithms do the new processors employ for division, anyway?

[1] https://www.agner.org/optimize/#manuals
[2] https://ridiculousfish.com/blog/posts/benchmarking-libdivide-m1-avx512.html
[3] https://raspberrypi.github.io/pico-sdk-doxygen/group__hardware__divider.html#details

Orgy answered 27/11, 2021 at 7:40 Comment(1)
I think what happened is M1 happened. Just by using libdivide you can get many times better perfomance than old Intel divq. Yet it became false in M1. I reported some very strange bugs in libdivide 128 bit stuff, after the fix it again became faster than M1 (LOL). Then Intel released Xeon on Ice Lake (8 gen) that is 4 times faster than anything libdivide could have come up with (that is not merged in libdivide yet even). There is also an algorithm that GMP as part of gcc uses, that is even faster. Just by intergrating that algorithm on software level in Minix OS and in ucode of Bigcore...Furniture
C
6

On Intel before Ice Lake, 64-bit operand-size is an outlier, much slower than 32-bit operand size for integer division. div r32 is 10 uops, with 26 cycle worst-case latency but 6 cycle throughput. (https://uops.info/ and https://agner.org/optimize/, and Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux has detailed exploration.)

There wasn't a fundamental change in how divide units are built, just widening the HW divider to not need extended-precision microcode. (Intel has had fast-ish dividers for FP for much longer, and that's basically the same problem just with only 53 bits instead of 64. The hard part of FP division is integer division of the mantissas; subtracting the exponents is easy and done in parallel.)

The incremental changes are things like widening the radix to handle more bits with each step. And for example pipelining the refinement steps after the initial (table lookup?) value, to improve throughput but not latency.


Related:



Divide units historically were often not pipelined at all, as that's hard because it requires replicating a lot of gates instead of iterating on the same multipliers, I think. And most software usually avoids (or avoided) integer division because it was historically very expensive, at least does it infrequently enough to not benefit very much from higher-throughput dividers with the same latency.

But with wider CPU pipelines with higher IPC shrinking the cycle gap between divisions, it's more worth doing. Also with huge transistor budgets, spending a bunch on something that will sit idle for a lot of the time in most programs still makes sense if it's very helpful for a few programs. (Like wider SIMD, and specialized execution units like x86 BMI2 pdep / pext). Dark silicon is necessary or chips would melt; power density is a huge concern, see Modern Microprocessors: A 90-Minute Guide!

Also, more and more software being written by people who don't know anything about performance, and more code avoiding compile-time constants in favour of being flexible (function args that ultimately come from some config option), I'd guess modern software doesn't avoid division as much as older programs did.

Floating-point division is often harder to avoid than integer, so it's definitely worth having fast FP dividers. And integer can borrow the mantissa divider from the low SIMD element, if there isn't a dedicated integer-divide unit.

So that FP motivation was likely the actual driving force behind Intel's improvements to divide throughput and latency even though they left 64-bit integer division with garbage performance until Ice Lake.

Christcrossrow answered 27/11, 2021 at 10:9 Comment(8)
I didn't know that integer divisions are that costly on Intel. 32bit arm doesn't have any div instruction and the software routine takes 23 cycles for 32bit. (plus the function call overhead) I thought the claim "arm doesn't need a div instruction" to be a bad excuse, but it was more than true.Taveda
@Jake'Alquimista'LEE: Some light-weight ARM CPUs don't have a div instruction, but cortex-a cores have sdiv and udiv. (And a mul-subtract instruction to get a remainder from it) e.g. godbolt.org/z/hbG81zj8Y. (Having a div that's only a a few uops allows OoO exec around it. That's one reason it's important that Intel didn't microcode FP division the way they did for integer, although even integer div's front-end cost on Skylake is not too bad at 10 uops compared to the latency and throughput of the execution unit.)Christcrossrow
Thank you for the answer, very interesting and informative (as always)! But I'm not quite convinced it explains it. You definitely know better than me but is it possible for "incremental changes" to bring 3x speed up? Is M1 10x faster than Cascade Lake Xeon only by incremental changes? And Pico does not even have a FP unit but still divides in 8 cycles. And would've assumed that incremental changes would be noticeable across various microarchitectures but Cannon Lake was suddenly much faster. PS: And by that "extended-precision microcode" you mean Intel's 80-bit math (as in long double)?Orgy
Also, on uops.info I noticed something curious: some of the newer architectures don't have variable latency for DIV. Any idea why that might be? Wouldn't that alone imply a more radical change to the divider?Orgy
@Jake'Alquimista'LEE Could you please point me to that "23 cycles" software-emulated division?Orgy
@EcirHana: By "extended precision" I mean like how you'd do 64-bit or 128-bit division in 32-bit mode, using two or more div instructions. Like BigInt techniques, that's why it's so many more uops. That's why div r64 is so different from div r32 on Skylake / Cascade Lake, but not on AMD or on Ice Lake. That was a non-incremental change that vastly sped up 64-bit division. But div r32 has 6 cycle throughput on Skylake, so M1 is "only" 3x better throughput. (IDK the latency; I'd guess it's nowhere near 3x better since they probably just heavily pipelined a divider, or replicated it.)Christcrossrow
@EcirHana: newer dividers not having data-dependent performance: That might be due to the radix being large enough that it always gets full precision in few enough steps. Then it's no longer worth the scheduling difficulty of early-out conditions? If so, an incremental change just got past a breakpoint / heuristic at which point it was no longer worth doing some other complexity. Also likely related to wanting to pipeline the divider more for throughput; pipelining is also easier with fixed latency.Christcrossrow
@EcirHana That's the default division what the compilers do.Taveda

© 2022 - 2024 — McMap. All rights reserved.