Why is JMH saying that returning 1 is faster than returning 0
Asked Answered
M

1

23

Can someone explain why JMH saying that returning 1 is faster than returning 0 ?

Here is the benchmark code.

import org.openjdk.jmh.annotations.*;

import java.util.concurrent.TimeUnit;

@State(Scope.Thread)
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@Fork(value = 3, jvmArgsAppend = {"-server", "-disablesystemassertions"})
public class ZeroVsOneBenchmark {

    @Benchmark
    @Warmup(iterations = 3, time = 2, timeUnit = TimeUnit.SECONDS)
    public int zero() {
        return 0;
    }

    @Benchmark
    @Warmup(iterations = 3, time = 2, timeUnit = TimeUnit.SECONDS)
    public int one() {
        return 1;
    }
}

Here is the result:

# Run complete. Total time: 00:03:05

Benchmark                       Mode   Samples        Score  Score error    Units
c.m.ZeroVsOneBenchmark.one     thrpt        60  1680674.502    24113.014   ops/ms
c.m.ZeroVsOneBenchmark.zero    thrpt        60   735975.568    14779.380   ops/ms

The same behaviour for one, two and zero

# Run complete. Total time: 01:01:56

Benchmark                       Mode   Samples        Score  Score error    Units
c.m.ZeroVsOneBenchmark.one     thrpt        90  1762956.470     7554.807   ops/ms
c.m.ZeroVsOneBenchmark.two     thrpt        90  1764642.299     9277.673   ops/ms
c.m.ZeroVsOneBenchmark.zero    thrpt        90   773010.467     5031.920   ops/ms
Manslaughter answered 22/7, 2014 at 10:29 Comment(2)
Hey I was creating baseline and saw this behaviour, it's not that I'm spending my time to measure just this example. it's simplified version for this thread.Manslaughter
Then create a non-simplified version showing what you see. Needs to be complex enough that all the Hotspot work is insignificant related to what you want to measure.Viole
H
35

JMH is a good tool but still not perfect.

Certainly there is no speed difference between returning 0, 1 or any other integer. However it makes difference how the value is consumed by JMH and how this is compiled by HotSpot JIT.

To prevent JIT from optimizing out calculations, JMH uses the special Blackhole class to consume values returned from a benchmark. Here is a one for integer values:

public final void consume(int i) {
    if (i == i1 & i == i2) {
        // SHOULD NEVER HAPPEN
        nullBait.i1 = i; // implicit null pointer exception
    }
}

Here i is a value returned from a benchmark. In your case it is either 0 or 1. When i == 1 the never-happen condition looks like if (1 == i1 & 1 == i2) which is compiled as follows:

0x0000000002b4ffe5: mov    0xb0(%r13),%r10d   ;*getfield i1
0x0000000002b4ffec: mov    0xb4(%r13),%r8d    ;*getfield i2
0x0000000002b4fff3: cmp    $0x1,%r8d
0x0000000002b4fff7: je     0x0000000002b50091  ;*return

But when i == 0 JIT tries to "optimize" two comparisions to 0 using setne instructions. However the result code becomes too complicated:

0x0000000002a40b28: mov    0xb0(%rdi),%r10d   ;*getfield i1
0x0000000002a40b2f: mov    0xb4(%rdi),%r8d    ;*getfield i2
0x0000000002a40b36: test   %r10d,%r10d
0x0000000002a40b39: setne  %r10b
0x0000000002a40b3d: movzbl %r10b,%r10d
0x0000000002a40b41: test   %r8d,%r8d
0x0000000002a40b44: setne  %r11b
0x0000000002a40b48: movzbl %r11b,%r11d
0x0000000002a40b4c: xor    $0x1,%r10d
0x0000000002a40b50: xor    $0x1,%r11d
0x0000000002a40b54: and    %r11d,%r10d
0x0000000002a40b57: test   %r10d,%r10d
0x0000000002a40b5a: jne    0x0000000002a40c15  ;*return

That is, slower return 0 is explained by more CPU instructions executed in Blackhole.consume().

Note to JMH developers: I would suggest rewriting Blackhole.consume like

if (i == l1) {
     // SHOULD NEVER HAPPEN
    nullBait.i1 = i; // implicit null pointer exception
}

where volatile long l1 = Long.MIN_VALUE. In this case the condition will still be always-false but it will be compiled equally for all return values.

Harwell answered 22/7, 2014 at 22:35 Comment(5)
@AlekseyShipilev must be interested in that :)Harwell
this explains a lot. Thank you!Manslaughter
@apangin: That's an interesting idea, however: a) it does not scale to other data types, and we'd like to keep the consume-s consistent across the consumed types; b) it goes for widening conversion, which is bad for 32-bit platforms (think ARM).Limonene
The real takeaway from this example is that nanobenchmarks require validation on assembly level, which is convenient now with JMH's -prof perfasm :)Limonene
@apangin: if JMH isn't perfect, then which one is?Vivyan

© 2022 - 2024 — McMap. All rights reserved.