On Intel before Ice Lake, 64-bit operand-size is an outlier, much slower than 32-bit operand size for integer division. div r32
is 10 uops, with 26 cycle worst-case latency but 6 cycle throughput. (https://uops.info/ and https://agner.org/optimize/, and Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux has detailed exploration.)
There wasn't a fundamental change in how divide units are built, just widening the HW divider to not need extended-precision microcode. (Intel has had fast-ish dividers for FP for much longer, and that's basically the same problem just with only 53 bits instead of 64. The hard part of FP division is integer division of the mantissas; subtracting the exponents is easy and done in parallel.)
The incremental changes are things like widening the radix to handle more bits with each step. And for example pipelining the refinement steps after the initial (table lookup?) value, to improve throughput but not latency.
Related:
Divide units historically were often not pipelined at all, as that's hard because it requires replicating a lot of gates instead of iterating on the same multipliers, I think. And most software usually avoids (or avoided) integer division because it was historically very expensive, at least does it infrequently enough to not benefit very much from higher-throughput dividers with the same latency.
But with wider CPU pipelines with higher IPC shrinking the cycle gap between divisions, it's more worth doing. Also with huge transistor budgets, spending a bunch on something that will sit idle for a lot of the time in most programs still makes sense if it's very helpful for a few programs. (Like wider SIMD, and specialized execution units like x86 BMI2 pdep
/ pext
). Dark silicon is necessary or chips would melt; power density is a huge concern, see Modern Microprocessors: A 90-Minute Guide!
Also, more and more software being written by people who don't know anything about performance, and more code avoiding compile-time constants in favour of being flexible (function args that ultimately come from some config option), I'd guess modern software doesn't avoid division as much as older programs did.
Floating-point division is often harder to avoid than integer, so it's definitely worth having fast FP dividers. And integer can borrow the mantissa divider from the low SIMD element, if there isn't a dedicated integer-divide unit.
So that FP motivation was likely the actual driving force behind Intel's improvements to divide throughput and latency even though they left 64-bit integer division with garbage performance until Ice Lake.