Since the beginning of CPUs it has been general knowledge that the integer division instruction is expensive. I went to see how bad it is today, on CPUs which have the luxury of billions of transistors. I found that the hardware idiv
instruction still performs significantly worse for constant divisors than the code the JIT compiler is able to emit, which doesn't contain the idiv
instruction.
To bring this out in a dedicated microbenchmark I've written the following:
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@OperationsPerInvocation(MeasureDiv.ARRAY_SIZE)
@Warmup(iterations = 8, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@State(Scope.Thread)
@Fork(1)
public class MeasureDiv
{
public static final int ARRAY_SIZE = 128;
public static final long DIVIDEND_BASE = 239520948509234807L;
static final int DIVISOR = 10;
final long[] input = new long[ARRAY_SIZE];
@Setup(Level.Iteration) public void setup() {
for (int i = 0; i < input.length; i++) {
input[i] = DIVISOR;
}
}
@Benchmark public long divVar() {
long sum = 0;
for (int i = 0; i < ARRAY_SIZE; i++) {
final long in = input[i];
final long dividend = DIVIDEND_BASE + i;
final long divisor = in;
final long quotient = dividend / divisor;
sum += quotient;
}
return sum;
}
@Benchmark public long divConst() {
long sum = 0;
for (int i = 0; i < ARRAY_SIZE; i++) {
final long in = input[i];
final long dividend = DIVIDEND_BASE + in;
final int divisor = DIVISOR;
final long quotient = dividend / divisor;
sum += quotient;
}
return sum;
}
}
In a nutshell, I have two methods identical in every respect except that one (divVar
) performs division by a number read off an array while the other divides by a compile-time constant. These are the results:
Benchmark Mode Cnt Score Error Units
MeasureDiv.divConst avgt 5 1.228 ± 0.032 ns/op
MeasureDiv.divVar avgt 5 8.913 ± 0.192 ns/op
The performance ratio is quite extraordinary. My expectation would be that a modern Intel processor has enough real estate, and its engineers have enough interest, to implement a complex but performant division algorithm in hardware. Yet the JIT compiler beats Intel by sending it a stream of some other instructions which perform the same job, just seven times faster. If anything, dedicated microcode should be able to utilize the CPU even better than what JIT can do through the public API of assembly instructions.
How come idiv
is still so much slower, what is the fundamental limitation?
One explanation that comes to mind is the hypothetical existence of a division algorithm which involves the dividend for the first time very late into the process. The JIT compiler would then have a head start because it would evaluate the first part which involves only the divisor at compile time and emit just the second part of the algorithm as runtime code. Is that hypothesis true?
add
,sub
,sar
, andshl
. – Pliabledivisor
is along
, and in the other case it's anint
(just mentioning it - it may or may not be relevant...) – Aquacadesar
internally? I understand why a CPU made of 10,000 transistors wouldn't have many options, but why are today's CPUs still quite bad at it? Is it fundamental? – Pliable0x6666666667
(I don't remember the exact number of 6's) and assumed that some modulo-arithmetic property of multiplication must be in action. What I am unsure of is how fundamentally slow this inital step is. For example, the hardware can parallelize any codepaths without data dependencies. – Pliable