Large performance gap between CPU's div instruction and HotSpot's JIT code
Asked Answered
P

1

11

Since the beginning of CPUs it has been general knowledge that the integer division instruction is expensive. I went to see how bad it is today, on CPUs which have the luxury of billions of transistors. I found that the hardware idiv instruction still performs significantly worse for constant divisors than the code the JIT compiler is able to emit, which doesn't contain the idiv instruction.

To bring this out in a dedicated microbenchmark I've written the following:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@OperationsPerInvocation(MeasureDiv.ARRAY_SIZE)
@Warmup(iterations = 8, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@State(Scope.Thread)
@Fork(1)
public class MeasureDiv
{
  public static final int ARRAY_SIZE = 128;
  public static final long DIVIDEND_BASE = 239520948509234807L;
  static final int DIVISOR = 10;
  final long[] input = new long[ARRAY_SIZE];

  @Setup(Level.Iteration) public void setup() {
    for (int i = 0; i < input.length; i++) {
      input[i] = DIVISOR;
    }
  }

  @Benchmark public long divVar() {
    long sum = 0;
    for (int i = 0; i < ARRAY_SIZE; i++) {
      final long in = input[i];
      final long dividend = DIVIDEND_BASE + i;
      final long divisor = in;
      final long quotient = dividend / divisor;
      sum += quotient;
    }
    return sum;
  }

  @Benchmark public long divConst() {
    long sum = 0;
    for (int i = 0; i < ARRAY_SIZE; i++) {
      final long in = input[i];
      final long dividend = DIVIDEND_BASE + in;
      final int divisor = DIVISOR;
      final long quotient = dividend / divisor;
      sum += quotient;
    }
    return sum;
  }
}

In a nutshell, I have two methods identical in every respect except that one (divVar) performs division by a number read off an array while the other divides by a compile-time constant. These are the results:

Benchmark            Mode  Cnt  Score   Error  Units
MeasureDiv.divConst  avgt    5  1.228 ± 0.032  ns/op
MeasureDiv.divVar    avgt    5  8.913 ± 0.192  ns/op

The performance ratio is quite extraordinary. My expectation would be that a modern Intel processor has enough real estate, and its engineers have enough interest, to implement a complex but performant division algorithm in hardware. Yet the JIT compiler beats Intel by sending it a stream of some other instructions which perform the same job, just seven times faster. If anything, dedicated microcode should be able to utilize the CPU even better than what JIT can do through the public API of assembly instructions.

How come idiv is still so much slower, what is the fundamental limitation?

One explanation that comes to mind is the hypothetical existence of a division algorithm which involves the dividend for the first time very late into the process. The JIT compiler would then have a head start because it would evaluate the first part which involves only the divisor at compile time and emit just the second part of the algorithm as runtime code. Is that hypothesis true?

Pliable answered 12/7, 2015 at 21:2 Comment(12)
Have you taken a look at the code that the JIT compiler is emitting? Can you show it to us?Rootless
I have, it's a quite long stretch of cheap instructions like add, sub, sar, and shl.Pliable
This is a fairly standard optimization - most (all?) modern compilers will try to optimize constant divisions to a series of shifts, subs and adds, etc. This answer about C compilers has quite a bit more detail: https://mcmap.net/q/87335/-does-a-c-c-compiler-optimize-constant-divisions-by-power-of-two-value-into-shiftsCultured
Just for reference, the backlink: codereview.stackexchange.com/q/96516/41175 . It might be interesting to have the JIT assembler code (and the JVM/target system - as discussed in the linked Q/A, the emitted assembler code depends on whether it's a 32/64bit system). BTW: In one case, the divisor is a long, and in the other case it's an int (just mentioning it - it may or may not be relevant...)Aquacade
@Cultured Yes, it's fairly standard; I'm asking why it's possible. Why doesn't the CPU turn a divide-by-8 into a cheap sar internally? I understand why a CPU made of 10,000 transistors wouldn't have many options, but why are today's CPUs still quite bad at it? Is it fundamental?Pliable
BTW @Marco13, this is tangential to the earlier question, which in all cases uses division by a constant. There it was suprising that your loop unrolling broke the 1 ns barrier that we see here as the better result. It managed to complete 19 divisions+modulos within 7 ns. According to the findings here, Constantin's result of 44 ns was expected (just above 2 ns per step).Pliable
@MarkoTopolnik yes, it's fundamental - if the constant is known in advance, it's possible to come up with a bunch of magic coefficients and operations (and leverage the much faster mul, in particular) to do the division faster. Actually finding them and then performing the fast division takes longer than a general purpose divide since typically this involves finding a reciprocal. It's just that the compiler has (comparatively) all the time in the world.Cultured
@Cultured Yes, that's the kind of answer I'm looking for. For division by ten I noticed the magic multiplication factor of 0x6666666667 (I don't remember the exact number of 6's) and assumed that some modulo-arithmetic property of multiplication must be in action. What I am unsure of is how fundamentally slow this inital step is. For example, the hardware can parallelize any codepaths without data dependencies.Pliable
If the reciprocal cannot be avoided then it's obvious: in the initial step we use division itself. Obviously that cannot be sped up for the general case.Pliable
@MarkoTopolnik in order to find these, you have to do a fair bit of computation - several divisions, a search by approximation, etc - view source on this thing to get an idea. hackersdelight.org/magic.htm I'm no silicon expert but that looks like a lot more work than tossing another block on the ALU.Cultured
Thanks, @pvg, great link! And yes, 0x66666667 is exactly the magic number produced for the divisor of 10 :)Pliable
@MarkoTopolnik And to finish the other half of this (despite the fact it's taken me an impressive number of comments to dribble it out in a semi-coherent manner) - the general division is slow because unlike the other basic arithmetic operations, it's iterative. Moral of story - try not to divide by non-constants :)Cultured
P
1

As explained by user pvg through the comments, the hypothesized algorithm is indeed in existence and the best one currently known. The algorithm involves division by the same divisor in the preparatory step, so it is fundamentally irreducible as a whole. It is covered in Chapter 10 of the classic publication Hacker's Delight.

Pliable answered 13/7, 2015 at 8:14 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.