This JMH benchmark is inconsistent across machines - why?

Asked 5/9, 2015 at 8:30 Answered 6/9, 2015 at 3:25

I'm trying to write a method like this:

static boolean fitsInDouble(long x) {
  // return true if x can be represented
  // as a numerically-equivalent double
}

And I'm trying to find the most efficient implementation. I settled on one, but then a coworker ran the benchmarks and got different relative results. The fastest implementation for me isn't the fastest for him.

Is there something wrong with these benchmarks?

package rnd;

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.Warmup;
import org.openjdk.jmh.infra.Blackhole;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;

import java.math.BigDecimal;
import java.util.concurrent.TimeUnit;

@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(1)
@Measurement(iterations = 5)
@Warmup(iterations = 5)
public class Benchmarks {

  public static void main(String[] args) throws Exception {
    Options options = new OptionsBuilder()
        .include(Benchmarks.class.getName())
        .build();
    new Runner(options).run();
  }

  @Benchmark
  public void bigDecimal(Blackhole bh) {
    for (long x : NUMBERS) bh.consume(bigDecimal(x));
  }

  @Benchmark
  public void cast(Blackhole bh) {
    for (long x : NUMBERS) bh.consume(cast(x));
  }

  @Benchmark
  public void zeros(Blackhole bh) {
    for (long x : NUMBERS) bh.consume(zeros(x));
  }

  public static boolean bigDecimal(long x) {
    BigDecimal a = new BigDecimal(x);
    BigDecimal b = new BigDecimal((double) x);
    return a.compareTo(b) == 0;
  }

  public static boolean cast(long x) {
    return x == (long) (double) x
        && x != Long.MAX_VALUE;
  }

  public static boolean zeros(long x) {
    long a = Math.abs(x);
    int z = Long.numberOfLeadingZeros(a);
    return z > 10 || Long.numberOfTrailingZeros(a) > 10 - z;
  }

  private static final long[] NUMBERS = {
      0,
      1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
      -1, -2, -3, -4, -5, -6, -7, -8, -9, -10,
      123, 456, 789,
      -123, -456, -789,
      101112, 131415, 161718,
      -101112, -131415, -161718,
      11L,
      222L,
      3333L,
      44444L,
      555555L,
      6666666L,
      77777777L,
      888888888L,
      9999999999L,
      1111L,
      22222L,
      333333L,
      4444444L,
      55555555L,
      666666666L,
      7777777777L,
      88888888888L,
      999999999999L,
      11111111,
      222222222,
      3333333333L,
      44444444444L,
      555555555555L,
      6666666666666L,
      77777777777777L,
      888888888888888L,
      9999999999999999L,
      Long.MAX_VALUE,
      Long.MAX_VALUE - 1,
      Long.MIN_VALUE,
      Long.MIN_VALUE + 1,
      (1L << 53),
      (1l << 53) + 1,
      (1l << 53) + 2,
      (1l << 60),
      (1l << 60) + 1,
      (1l << 60) + 8,
      (1l << 60) + 32,
      (1l << 60) + 64,
      (1l << 60) + 128,
      (1l << 60) + 256,
      (-1L << 53),
      (-1L << 53) - 1,
      (-1L << 53) - 2,
      (-1l << 60),
      (-1l << 60) - 1,
      (-1l << 60) - 8,
      (-1l << 60) - 32,
      (-1l << 60) - 64,
      (-1l << 60) - 128,
      (-1l << 60) - 256
  };
}

There are small differences in our environments.

Me: Windows 10, JDK 1.8.0_45, "zeros" is the fastest

Him: Windows 7, JDK 1.8.0_20, "cast" is the fastest

Our results are self-consistent from run to run, whether running in an IDE or from the command line. We're using JMH 1.10.5.

What is happening here? The benchmark seems untrustworthy and I don't know how to fix it.

Marketa answered 5/9, 2015 at 8:30 Comment(7)

I'm not competent to judge the quality of the benchmark, but there is a really simple answer to your question: the results are different because the environment is different: not the same JVM, not the same OS, probably not the same hardware. That makes three good reasons for the results to be different. – Hoarding 5/9, 2015 at 8:36

@JBNizet if it turns out to be a JDK minor version or Windows version or processor-specific thing, that would be frustrating, but at least I'd have an answer. I'd like to know what exactly the cause is. The end goal is to put this method in a library. If the results are dependent on something in the environment, I'd like to understand which is more likely to be fast. Right now it looks like a coin flip to me. I'm hoping that I made a stupid error in the benchmark code. – Marketa 5/9, 2015 at 8:51

I have both 1.8.0_45 and 1.8.0_20 installed. Both JVMs show better result on the cast test, though the difference is not very much (like 380 ns vs 390 ns). For further investigation please provide full per-iteration log for both machines. – Tuff 5/9, 2015 at 9:33

@TagirValeev Thanks, I'll try to put up full logs when we're all at work this coming Tuesday. Hopefully I'll convince some other coworkers to join in. On my machine "zeros" is reliably 20% faster than "cast", always. If I'm the mutant I'll probably put "cast" into the library, since it's easier to understand. – Marketa 5/9, 2015 at 9:47

Also note that real performance might differ as users may call this method with numbers which are distributed in a different manner. Here most of the tests are some corner cases, which is probably unlikely in the real code (I don't know the purpose of your method) – Tuff 5/9, 2015 at 10:5

You shouldn't make decisions based on small differences. All you need is for the machine, the OS, the JVM, the temperature of the CPU, the location the program is loaded in memory, or what other code your program is running to be different to flip the outcome. When there is a small difference, you should consider other factors such as how simple or clean the code is. – Raimundo 5/9, 2015 at 15:28

BTW If you have hyper-threading enabled, the operations being performed by the other thread on the same CPU can change the result. – Raimundo 5/9, 2015 at 15:35

I can reproduce different results even on the same machine with the same environment: sometimes cast is slightly faster, sometimes zeros is.

# JMH 1.10.5 (released 9 days ago)
# VM invoker: C:\Program Files\Java\jdk1.8.0_40\jre\bin\java.exe
# VM options: -Didea.launcher.port=7540 -Didea.launcher.bin.path=C:\Program Files (x86)\IDEA 14.1.3\bin -Dfile.encoding=UTF-8
# Warmup: 3 iterations, 1 s each
# Measurement: 5 iterations, 1 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: bench.LongDouble.cast

# Run progress: 0,00% complete, ETA 00:01:20
# Fork: 1 of 5
# Warmup Iteration   1: 513,793 ns/op
# Warmup Iteration   2: 416,508 ns/op
# Warmup Iteration   3: 402,110 ns/op
Iteration   1: 402,535 ns/op
Iteration   2: 403,999 ns/op
Iteration   3: 404,871 ns/op
Iteration   4: 404,845 ns/op
Iteration   5: 401,705 ns/op

# Run progress: 10,00% complete, ETA 00:01:16
# Fork: 2 of 5
# Warmup Iteration   1: 421,552 ns/op
# Warmup Iteration   2: 418,925 ns/op
# Warmup Iteration   3: 421,813 ns/op
Iteration   1: 420,978 ns/op
Iteration   2: 422,940 ns/op
Iteration   3: 422,009 ns/op
Iteration   4: 423,011 ns/op
Iteration   5: 422,406 ns/op

# Run progress: 20,00% complete, ETA 00:01:07
# Fork: 3 of 5
# Warmup Iteration   1: 414,057 ns/op
# Warmup Iteration   2: 410,364 ns/op
# Warmup Iteration   3: 402,330 ns/op
Iteration   1: 402,776 ns/op
Iteration   2: 404,764 ns/op
Iteration   3: 400,346 ns/op
Iteration   4: 403,227 ns/op
Iteration   5: 403,350 ns/op

# Run progress: 30,00% complete, ETA 00:00:58
# Fork: 4 of 5
# Warmup Iteration   1: 422,161 ns/op
# Warmup Iteration   2: 419,118 ns/op
# Warmup Iteration   3: 402,990 ns/op
Iteration   1: 401,592 ns/op
Iteration   2: 402,999 ns/op
Iteration   3: 403,035 ns/op
Iteration   4: 402,625 ns/op
Iteration   5: 403,396 ns/op

# Run progress: 40,00% complete, ETA 00:00:50
# Fork: 5 of 5
# Warmup Iteration   1: 422,621 ns/op
# Warmup Iteration   2: 419,596 ns/op
# Warmup Iteration   3: 403,047 ns/op
Iteration   1: 403,438 ns/op
Iteration   2: 405,066 ns/op
Iteration   3: 403,271 ns/op
Iteration   4: 403,021 ns/op
Iteration   5: 402,162 ns/op


Result "cast":
  406,975 ?(99.9%) 5,906 ns/op [Average]
  (min, avg, max) = (400,346, 406,975, 423,011), stdev = 7,884
  CI (99.9%): [401,069, 412,881] (assumes normal distribution)


# JMH 1.9.3 (released 114 days ago, please consider updating!)
# VM invoker: C:\Program Files\Java\jdk1.8.0_40\jre\bin\java.exe
# VM options: -Didea.launcher.port=7540 -Didea.launcher.bin.path=C:\Program Files (x86)\IDEA 14.1.3\bin -Dfile.encoding=UTF-8
# Warmup: 3 iterations, 1 s each
# Measurement: 5 iterations, 1 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: bench.LongDouble.zeros

# Run progress: 50,00% complete, ETA 00:00:41
# Fork: 1 of 5
# Warmup Iteration   1: 439,529 ns/op
# Warmup Iteration   2: 437,752 ns/op
# Warmup Iteration   3: 390,530 ns/op
Iteration   1: 389,394 ns/op
Iteration   2: 391,453 ns/op
Iteration   3: 390,446 ns/op
Iteration   4: 390,822 ns/op
Iteration   5: 389,850 ns/op

# Run progress: 60,00% complete, ETA 00:00:33
# Fork: 2 of 5
# Warmup Iteration   1: 438,252 ns/op
# Warmup Iteration   2: 437,446 ns/op
# Warmup Iteration   3: 448,328 ns/op
Iteration   1: 389,979 ns/op
Iteration   2: 392,741 ns/op
Iteration   3: 390,575 ns/op
Iteration   4: 390,492 ns/op
Iteration   5: 390,000 ns/op

# Run progress: 70,00% complete, ETA 00:00:25
# Fork: 3 of 5
# Warmup Iteration   1: 447,939 ns/op
# Warmup Iteration   2: 444,489 ns/op
# Warmup Iteration   3: 414,433 ns/op
Iteration   1: 417,409 ns/op
Iteration   2: 413,518 ns/op
Iteration   3: 413,388 ns/op
Iteration   4: 414,040 ns/op
Iteration   5: 415,935 ns/op

# Run progress: 80,00% complete, ETA 00:00:16
# Fork: 4 of 5
# Warmup Iteration   1: 439,012 ns/op
# Warmup Iteration   2: 437,345 ns/op
# Warmup Iteration   3: 388,208 ns/op
Iteration   1: 395,647 ns/op
Iteration   2: 389,221 ns/op
Iteration   3: 387,539 ns/op
Iteration   4: 388,524 ns/op
Iteration   5: 387,623 ns/op

# Run progress: 90,00% complete, ETA 00:00:08
# Fork: 5 of 5
# Warmup Iteration   1: 446,116 ns/op
# Warmup Iteration   2: 446,622 ns/op
# Warmup Iteration   3: 409,116 ns/op
Iteration   1: 409,761 ns/op
Iteration   2: 410,146 ns/op
Iteration   3: 410,060 ns/op
Iteration   4: 409,370 ns/op
Iteration   5: 411,114 ns/op


Result "zeros":
  399,162 ?(99.9%) 8,487 ns/op [Average]
  (min, avg, max) = (387,539, 399,162, 417,409), stdev = 11,330
  CI (99.9%): [390,675, 407,649] (assumes normal distribution)


# Run complete. Total time: 00:01:23

Benchmark         Mode  Cnt    Score   Error  Units
LongDouble.cast   avgt   25  406,975 ± 5,906  ns/op
LongDouble.zeros  avgt   25  399,162 ± 8,487  ns/op

After some analysis I've found that the problem is not in the benchmark but rather in JMH. perfasm profiler pointed to the Blackhole.consume method:

public final void consume(boolean bool) {
    boolean bool1 = this.bool1; // volatile read
    boolean bool2 = this.bool2;
    if (bool == bool1 & bool == bool2) {
        // SHOULD NEVER HAPPEN
        nullBait.bool1 = bool; // implicit null pointer exception
    }
}

The interesting part is how bool1 and bool2 are initialized:

Random r = new Random(System.nanoTime());
...
bool1 = r.nextBoolean(); bool2 = !bool1;

Yes, they are random each time! As you know, JIT compiler relies on the run-time execution profile, so the generated code slightly varies depending on the initial values of bool1 and bool2, particularly, in half cases it considers the branch taken, and in the rest half - untaken. That's where the difference comes from.

I've submitted the report against JMH with the suggested fix in case the authors confirm the defect.

Randellrandene answered 6/9, 2015 at 3:25 Comment(5)

This is one of the reasons why you cannot generally rely on a single fork: this misses an entire slab of run-to-run variance issues, not only caused by the benchmark, benchmark harness, but also by the runtime environment, OS, and hardware. Default JMH settings do 10 forks; if you "tune" it down to 1 fork, like OP did, then you have to face consequences. – Oe 6/9, 2015 at 9:36

Oh neat! When I use multiple forks (woops), I do see that run-to-run variance. "cast" shuffles between 290ns and 270ns, and "zeros" shuffles between 250ns and 240ns. So I think there's still this issue where "which is faster?" depends on... something environment-specific that I don't understand. But your discovery here is probably more interesting/useful than the answer to my original question. – Marketa 7/9, 2015 at 3:10

@MichaelHixson, this particular bug is fixed in JMH. Have you tried to reproduce the results with the updated version? – Tuff 26/10, 2015 at 7:55

@TagirValeev It fixes the run-to-run variance. It doesn't change the machine-to-machine variance, which is the subject of my original question. Sadly, I don't think I can justify asking my coworkers to spend more time on this, which means I won't be able to provide logs or other output for comparison. So, I don't think a true answer to my question will ever come in. At least something good came out of it (this answer and the bug fix). What do you think - should I mark it accepted? – Marketa 27/10, 2015 at 2:29

@MichaelHixson, that's up to you. However if the problem is not actually resolved I would not accept an answer. – Tuff 27/10, 2015 at 8:19

As JB Nizet notes, you cannot assume that a program will perform the same across multiple JVMs and/or operating systems, even more if you have different machines.

By the way, you don't need numberOfLeadingZeroes(a):

public static boolean zeroes2(final long x) {
    final long a = Math.abs(x);
    return (a >>> Long.numberOfTrailingZeros(a)) < (1L << 53);
}

Finally, if you really need utmost performance, either select a random sample of machines to test on and pick the one that performs best (unless there are machines where it performs substantially worse, though with your code sample, this is quite unlikely), or add all methods and create a calibration program that benchmarks all versions and picks the one that is fastest for the machine it's running on.

Edit: Also as Javier states, be sure to benchmark with multiple real-world-like workloads.

Donne answered 5/9, 2015 at 9:20 Comment(0)

CPU models evolve over time. If the balance of operations performance changes, or there is some newer improvement in branch prediction, then you have a consistent difference.

If you want to discard that one alternative has an spurious advantage over an specific data set (e.g. being able to guess the next case, where in real life it might not be able), you might randomize/shuffle your data set and make it longer. Just if you want to try, although it might probably be futile (and dubious, if you require to execute the exact same data).

final int times = 10;
List<Long> listShuffle = new ArrayList<>(NUMBERS.length * times);
for (long l : NUMBERS) {
    for (int i = 0; i < times; i++) {
        listShuffle.add(l);
    }
}

// Shake and serve chilled
Collections.shuffle(listShuffle);

P.D.#1 Also, if you can feed samples of real data rather than synthetic data, that might deliver more meaningful information. CPU guessing too much in that case might be actually good.

Questionless answered 5/9, 2015 at 10:51 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags