jmh indicates that M1 is faster than M2 but M1 delegates to M2
Asked Answered
I

4

10

I wrote a JMH benchmark involving 2 methods: M1 and M2. M1 invokes M2 but for some reason, JMH claims that M1 is faster than M2.

Here is the benchmark source-code:

import java.util.concurrent.TimeUnit;
import static org.bitbucket.cowwoc.requirements.Requirements.assertThat;
import static org.bitbucket.cowwoc.requirements.Requirements.requireThat;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class MyBenchmark {

    @Benchmark
    public void assertMethod() {
        assertThat("value", "name").isNotNull().isNotEmpty();
    }

    @Benchmark
    public void requireMethod() {
        requireThat("value", "name").isNotNull().isNotEmpty();
    }

    public static void main(String[] args) throws RunnerException {
        Options opt = new OptionsBuilder()
                .include(MyBenchmark.class.getSimpleName())
                .forks(1)
                .build();

        new Runner(opt).run();
    }
}

In the above example, M1 is assertThat(), M2 is requireThat(). Meaning, assertThat() invokes requireThat() under the hood.

Here is the benchmark output:

# JMH 1.13 (released 8 days ago)
# VM version: JDK 1.8.0_102, VM 25.102-b14
# VM invoker: C:\Program Files\Java\jdk1.8.0_102\jre\bin\java.exe
# VM options: -ea
# Warmup: 20 iterations, 1 s each
# Measurement: 20 iterations, 1 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: com.mycompany.jmh.MyBenchmark.assertMethod

# Run progress: 0.00% complete, ETA 00:01:20
# Fork: 1 of 1
# Warmup Iteration   1: 8.268 ns/op
# Warmup Iteration   2: 6.082 ns/op
# Warmup Iteration   3: 4.846 ns/op
# Warmup Iteration   4: 4.854 ns/op
# Warmup Iteration   5: 4.834 ns/op
# Warmup Iteration   6: 4.831 ns/op
# Warmup Iteration   7: 4.815 ns/op
# Warmup Iteration   8: 4.839 ns/op
# Warmup Iteration   9: 4.825 ns/op
# Warmup Iteration  10: 4.812 ns/op
# Warmup Iteration  11: 4.806 ns/op
# Warmup Iteration  12: 4.805 ns/op
# Warmup Iteration  13: 4.802 ns/op
# Warmup Iteration  14: 4.813 ns/op
# Warmup Iteration  15: 4.805 ns/op
# Warmup Iteration  16: 4.818 ns/op
# Warmup Iteration  17: 4.815 ns/op
# Warmup Iteration  18: 4.817 ns/op
# Warmup Iteration  19: 4.812 ns/op
# Warmup Iteration  20: 4.810 ns/op
Iteration   1: 4.805 ns/op
Iteration   2: 4.816 ns/op
Iteration   3: 4.813 ns/op
Iteration   4: 4.938 ns/op
Iteration   5: 5.061 ns/op
Iteration   6: 5.129 ns/op
Iteration   7: 4.828 ns/op
Iteration   8: 4.837 ns/op
Iteration   9: 4.819 ns/op
Iteration  10: 4.815 ns/op
Iteration  11: 4.872 ns/op
Iteration  12: 4.806 ns/op
Iteration  13: 4.811 ns/op
Iteration  14: 4.827 ns/op
Iteration  15: 4.837 ns/op
Iteration  16: 4.842 ns/op
Iteration  17: 4.812 ns/op
Iteration  18: 4.809 ns/op
Iteration  19: 4.806 ns/op
Iteration  20: 4.815 ns/op


Result "assertMethod":
  4.855 �(99.9%) 0.077 ns/op [Average]
  (min, avg, max) = (4.805, 4.855, 5.129), stdev = 0.088
  CI (99.9%): [4.778, 4.932] (assumes normal distribution)


# JMH 1.13 (released 8 days ago)
# VM version: JDK 1.8.0_102, VM 25.102-b14
# VM invoker: C:\Program Files\Java\jdk1.8.0_102\jre\bin\java.exe
# VM options: -ea
# Warmup: 20 iterations, 1 s each
# Measurement: 20 iterations, 1 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: com.mycompany.jmh.MyBenchmark.requireMethod

# Run progress: 50.00% complete, ETA 00:00:40
# Fork: 1 of 1
# Warmup Iteration   1: 7.193 ns/op
# Warmup Iteration   2: 4.835 ns/op
# Warmup Iteration   3: 5.039 ns/op
# Warmup Iteration   4: 5.053 ns/op
# Warmup Iteration   5: 5.077 ns/op
# Warmup Iteration   6: 5.102 ns/op
# Warmup Iteration   7: 5.088 ns/op
# Warmup Iteration   8: 5.109 ns/op
# Warmup Iteration   9: 5.096 ns/op
# Warmup Iteration  10: 5.096 ns/op
# Warmup Iteration  11: 5.091 ns/op
# Warmup Iteration  12: 5.089 ns/op
# Warmup Iteration  13: 5.099 ns/op
# Warmup Iteration  14: 5.097 ns/op
# Warmup Iteration  15: 5.090 ns/op
# Warmup Iteration  16: 5.096 ns/op
# Warmup Iteration  17: 5.088 ns/op
# Warmup Iteration  18: 5.086 ns/op
# Warmup Iteration  19: 5.087 ns/op
# Warmup Iteration  20: 5.097 ns/op
Iteration   1: 5.097 ns/op
Iteration   2: 5.088 ns/op
Iteration   3: 5.092 ns/op
Iteration   4: 5.097 ns/op
Iteration   5: 5.082 ns/op
Iteration   6: 5.089 ns/op
Iteration   7: 5.086 ns/op
Iteration   8: 5.084 ns/op
Iteration   9: 5.090 ns/op
Iteration  10: 5.086 ns/op
Iteration  11: 5.084 ns/op
Iteration  12: 5.088 ns/op
Iteration  13: 5.091 ns/op
Iteration  14: 5.092 ns/op
Iteration  15: 5.085 ns/op
Iteration  16: 5.096 ns/op
Iteration  17: 5.078 ns/op
Iteration  18: 5.125 ns/op
Iteration  19: 5.089 ns/op
Iteration  20: 5.091 ns/op


Result "requireMethod":
  5.091 �(99.9%) 0.008 ns/op [Average]
  (min, avg, max) = (5.078, 5.091, 5.125), stdev = 0.010
  CI (99.9%): [5.082, 5.099] (assumes normal distribution)


# Run complete. Total time: 00:01:21

Benchmark                       Mode  Cnt  Score   Error  Units
MyBenchmark.assertMethod        avgt   20  4.855 � 0.077  ns/op
MyBenchmark.requireMethod       avgt   20  5.091 � 0.008  ns/op

To reproduce this locally:

  1. Create a Maven project containing the above benchmark.

  2. Add the following dependency:

     <dependency>
         <groupId>org.bitbucket.cowwoc</groupId>
         <artifactId>requirements</artifactId>
         <version>2.0.0</version>
     </dependency>
    
  3. Alternatively, download the library from https://github.com/cowwoc/requirements.java/

I have the following questions:

  1. Can you reproduce this result on your end?
  2. What, if anything, is wrong with the benchmark?

UPDATE: I am finally getting consistent, meaningful results.

Benchmark                  Mode  Cnt   Score   Error  Units
MyBenchmark.assertMethod   avgt   60  22.552 ± 0.020  ns/op
MyBenchmark.requireMethod  avgt   60  22.411 ± 0.114  ns/op

By consistent, I mean that I get almost the same values across runs.

By meaningful, I mean that assertMethod() is slower than requireMethod().

I made the following changes:

  • Locked the CPU clock (min/max CPU set to 99% in Windows Power Options)
  • Added JVM options -XX:-TieredCompilation -XX:-ProfileInterpreter

Is anyone able to achieve these results without the doubling of run times?

UPDATE2: Disabling inlining yields the same results without a noticeable performance slowdown. I posted a more detailed answer here.

Idou answered 31/7, 2016 at 18:6 Comment(7)
This might not be the issue, but you are supposed to get inputs from @State fields, and sink the outputs to @Benchmark return value or to the explicit Blackhole. See: gist.github.com/shipilev/712b5e4e4800c3ed982dcbae97e4d6dfWivinah
I can't reproduce the difference on the similar configuration anyhow, which puts some suspicion on the environment setup validity. Try to run JMH core benchmark tests? Runnable JAR: central.maven.org/maven2/org/openjdk/jmh/jmh-core-benchmarks/…Wivinah
@AlekseyShipilev Good catch about the non-final inputs and sinking the output. I now get more realistic numbers (see UPDATE2), but assertMethod() is still faster than requireMethod(). I also ran the JMH core benchmark tests as you mentioned, and they seem to come back normal. I can't post the result on Stackoverflow due to the 30k character limit but if you email me (see my profile to get my email address) I will send you the file.Idou
@AlekseyShipilev What's interesting is that if I only run one benchmark method at a time (I comment out the other) I still get the same results. Meaning, it's not possible that one benchmark is interfering with the test results of the other.Idou
@Gili: I blame your OS (Windows) not being good with benchmarks. Below is what I get on Linux.Fer
So, it might be telling to see the differences in the generated code. On Windows, you'd need hsdis.dll, xperf from Windows Performance toolkit, and -prof xperfasm.Wivinah
@AlekseyShipilev I posted the benchmark, output, jmh-test output and xperfasm output to bitbucket.org/cowwoc/requirements/downloads. As far as I can tell, everything is correct. Please double check and let me know if you find any mistakes.Idou
P
14

In this particular case assertMethod is indeed compiled better than requireMethod due to register allocation issues.

The benchmark looks correct, and I can consistently reproduce your results.
To analyze the problem I've made the simplified benchmark:

package bench;

import com.google.common.collect.ImmutableMap;
import org.openjdk.jmh.annotations.*;

@State(Scope.Benchmark)
public class Requirements {
    private static boolean enabled = true;

    private String name = "name";
    private String value = "value";

    @Benchmark
    public Object assertMethod() {
        if (enabled)
            return requireThat(value, name);
        return null;
    }

    @Benchmark
    public Object requireMethod() {
        return requireThat(value, name);
    }

    public static Object requireThat(String parameter, String name) {
        if (name.trim().isEmpty())
            throw new IllegalArgumentException();
        return new StringRequirementsImpl(parameter, name, new Configuration());
    }

    static class Configuration {
        private Object context = ImmutableMap.of();
    }

    static class StringRequirementsImpl {
        private String parameter;
        private String name;
        private Configuration config;
        private ObjectRequirementsImpl asObject;

        StringRequirementsImpl(String parameter, String name, Configuration config) {
            this.parameter = parameter;
            this.name = name;
            this.config = config;
            this.asObject = new ObjectRequirementsImpl(parameter, name, config);
        }
    }

    static class ObjectRequirementsImpl {
        private Object parameter;
        private String name;
        private Configuration config;

        ObjectRequirementsImpl(Object parameter, String name, Configuration config) {
            this.parameter = parameter;
            this.name = name;
            this.config = config;
        }
    }
}

First of all, I've verified by -XX:+PrintInlining that the whole benchmark is inlined into one big method. Obviously this compilation unit has lots of nodes, and there are not enough CPU registers to hold all the intermediate variables. That is, compiler needs to spill some of them.

  • In assertMethod 4 registers are spilled to the stack before the call to trim().
  • In requireMethod 7 registers are spilled later, after the call to new Configuration().

-XX:+PrintAssembly output:

  assertMethod             |  requireMethod
  -------------------------|------------------------
  mov    %r11d,0x5c(%rsp)  |  mov    %rcx,0x20(%rsp)
  mov    %r10d,0x58(%rsp)  |  mov    %r11,0x48(%rsp)
  mov    %rbp,0x50(%rsp)   |  mov    %r10,0x30(%rsp)
  mov    %rbx,0x48(%rsp)   |  mov    %rbp,0x50(%rsp)
                           |  mov    %r9d,0x58(%rsp)
                           |  mov    %edi,0x5c(%rsp)
                           |  mov    %r8,0x60(%rsp) 

This is almost the only difference between two compiled methods in addition to if (enabled) check. So, the performance difference is explained by more variables spilled to memory.

Why the smaller method is compiled less optimal then? Well, the register allocation problem is known to be NP-complete. Since it cannot be solved ideally in reasonable time, compilers usually rely on certain heuristics. In a big method a tiny thing like an extra if may significantly change the result of register allocation algorithm.

However you don't need to worry about that. The effect we've seen does not mean that requireMethod is always compiled worse. In other use cases the compilation graph will be completely different due to inlining. Anyway, 1 nanosecond difference is nothing for the real application performance.

Pembroke answered 10/8, 2016 at 1:42 Comment(1)
Finally, an answer that makes sense! :) I especially like the fact that you were able to reproduce the problem using a simplified benchmark. Good job! Please try to move as much of the gist into this answer to prevent broken links in the future.Idou
F
2

You are running your test within a single VM process by specificing forks(1). During runtime, a virtual machine looks at your code and tries to figure out how itis actually executed. It then creates so-called profiles to optimize your application according to this observed behavior.

What most likely happens here is called profile pollution where running the first benchmark has an effect on the outcome of the second benchmark. Overly simpflified: if your VM was trained to do (a) very well by running its benchmark, it takes some additional time for it to get used to doing (b) afterwards. Therefore, (b) appears to take more time.

In order to avoid this, run your benchmark with multiple forks where the different benchmarks are run on fresh VM processes in order to avoid such profile polution. You can read more about forking in the samples that are provided by JMH.

You should also check the sample on state; you should not refer to your input as constants but let JMH handle the value's escape in order to apply an actual computation.

I guess that - if applied properly - both benchmarks would yield similar runtime.

Update - Here is what I get for the fixed benchmark:

Benchmark                  Mode  Cnt   Score   Error  Units
MyBenchmark.assertMethod   avgt   40  17,592 ± 1,493  ns/op
MyBenchmark.requireMethod  avgt   40  17,999 ± 0,920  ns/op

For the sake of completion, I also ran the benchmark with perfasm and both methods are basically compiled into the same thing.

Fer answered 1/8, 2016 at 15:0 Comment(6)
This answer sounded promising, but neither forking nor state isolation have worked for me. Please try reproducing the problem locally. If you are able to resolve it, please post working code so I can confirm your findings on my end.Idou
I added the updated benchmark to the question. Take a look. I would also appreciate you posting your own implementation in this answer.Idou
I can already spot the error. Do not make the strings final; this renders a compile time constant which is inlined by javac. Rather, annotate the benchmark itself and add the (non-final) fields to it. No need for the extra class.Fer
I tried your suggestion (annotate the method and use non-final fields) but it did not help. See the updated answer.Idou
Blaming Windows is an easy cop-out. I don't think we're there yet. For starters, why are your error margins so large? Mine are always less than 0.1 ns. With such large error margins, you can't conclusively say that your results are any different than mine (assertMethod() could still be faster than requireMethod()). Can you re-run and let me know if you get smaller error margins?Idou
I just reproduced the same relative numbers (assertMethod() faster than requireMethod()) on a second Windows machine and an OSX machine. I posted the benchmark, output, jmh-test output and xperfasm output to bitbucket.org/cowwoc/requirements/downloads. Let me know if you see anything wrong. I am specifically wondering if you get the same numbers if you run the exact benchmark I uploaded there (self-contained Maven project).Idou
I
1

Answering my own question:

It seems that inlining is skewing results. All I needed to do to get consistent, meaningful results was the following:

  • Lock the CPU clock (min/max CPU set to 99% in Windows Power Options)
  • Disable inlining by annotating both methods with @CompilerControl(CompilerControl.Mode.DONT_INLINE).

I now get the following results:

Benchmark                  Mode  Cnt   Score   Error  Units
MyBenchmark.assertMethod   avgt  200  11.462 ± 0.048  ns/op
MyBenchmark.requireMethod  avgt  200  11.138 ± 0.062  ns/op

I tried analyzing the output of -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining but couldn't find anything wrong. Both methods seem to get inlined the same way. <shrug>


The benchmark source-code is:

import java.util.concurrent.TimeUnit;
import static org.bitbucket.cowwoc.requirements.Requirements.assertThat;
import static org.bitbucket.cowwoc.requirements.Requirements.requireThat;
import org.bitbucket.cowwoc.requirements.StringRequirements;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.CompilerControl;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;

@State(Scope.Benchmark)
public class MyBenchmark {

    private String name = "name";
    private String value = "value";

    @Benchmark
    public void emptyMethod() {
    }

    // Inlining leads to unexpected results: https://mcmap.net/q/1065850/-jmh-indicates-that-m1-is-faster-than-m2-but-m1-delegates-to-m2
    @Benchmark
    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public StringRequirements assertMethod() {
        return assertThat(value, name).isNotNull().isNotEmpty();
    }

    @Benchmark
    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public StringRequirements requireMethod() {
        return requireThat(value, name).isNotNull().isNotEmpty();
    }

    public static void main(String[] args) throws RunnerException {
        Options opt = new OptionsBuilder()
                .include(MyBenchmark.class.getSimpleName())
                .jvmArgsAppend("-ea")
                .forks(3)
                .timeUnit(TimeUnit.NANOSECONDS)
                .mode(Mode.AverageTime)
                .build();

        new Runner(opt).run();
    }
}

UPDATE: apangin seems to have figured out why assertMethod() is faster than requireMethod().

Idou answered 9/8, 2016 at 21:38 Comment(0)
K
0

This is very frequent in microbenchmarks. When I download your code I get the same result but with other figures, apparently my computer is slower than yours. However, if I modify your source to use 5 forks, 100 warmup iterations and 20 measurement iterations then requireMethod becomes a little bit faster than assertMethod as expected.

JMH is great, but it is very easy to write tests that look good, but where you can't trust the results as there are too few iterations.

Kimberly answered 8/8, 2016 at 13:47 Comment(1)
You had me excited but I couldn't reproduce your result (using the same parameters you mentioned). I then tried going as high as 10 forks, 100 warmup iterations and 100 measurement iterations but assertMethod() still ran faster than requireMethod().Idou

© 2022 - 2024 — McMap. All rights reserved.