Why is returning a Java object reference so much slower than returning a primitive
Asked Answered
B

2

76

We are working on a latency sensitive application and have been microbenchmarking all kinds of methods (using jmh). After microbenchmarking a lookup method and being satisfied with the results, I implemented the final version, only to find that the final version was 3 times slower than what I had just benchmarked.

The culprit was that the implemented method was returning an enum object instead of an int. Here is a simplified version of the benchmark code:

@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Thread)
public class ReturnEnumObjectVersusPrimitiveBenchmark {

    enum Category {
        CATEGORY1,
        CATEGORY2,
    }

    @Param( {"3", "2", "1" })
    String value;

    int param;

    @Setup
    public void setUp() {
        param = Integer.parseInt(value);
    }

    @Benchmark
    public int benchmarkReturnOrdinal() {
        if (param < 2) {
            return Category.CATEGORY1.ordinal();
        }
        return Category.CATEGORY2.ordinal();        
    }


    @Benchmark
    public Category benchmarkReturnReference() {
        if (param < 2) {
            return Category.CATEGORY1;
        }
        return Category.CATEGORY2;      
    }


    public static void main(String[] args) throws RunnerException {
            Options opt = new OptionsBuilder().include(ReturnEnumObjectVersusPrimitiveBenchmark.class.getName()).warmupIterations(5)
                .measurementIterations(4).forks(1).build();
        new Runner(opt).run();
    }

}

The benchmark results for above:

# VM invoker: C:\Program Files\Java\jdk1.7.0_40\jre\bin\java.exe
# VM options: -Dfile.encoding=UTF-8

Benchmark                   (value)   Mode  Samples     Score     Error   Units
benchmarkReturnOrdinal            3  thrpt        4  1059.898 ±  71.749  ops/us
benchmarkReturnOrdinal            2  thrpt        4  1051.122 ±  61.238  ops/us
benchmarkReturnOrdinal            1  thrpt        4  1064.067 ±  90.057  ops/us
benchmarkReturnReference          3  thrpt        4   353.197 ±  25.946  ops/us
benchmarkReturnReference          2  thrpt        4   350.902 ±  19.487  ops/us
benchmarkReturnReference          1  thrpt        4   339.578 ± 144.093  ops/us

Just changing the return type of the function changed the performance by a factor of almost 3.

I thought that the sole difference between returning an enum object versus an integer is that one returns a 64 bit value (reference) and the other returns a 32 bit value. One of my colleagues was guessing that returning the enum added additional overhead because of the need to track the reference for potential GC. (But given that enum objects are static final references, it seems strange that it would need to do that).

What is the explanation for the performance difference?


UPDATE

I shared the maven project here so that anyone can clone it and run the benchmark. If anyone has the time/interest, it would be helpful to see if others can replicate the same results. (I've replicated on 2 different machines, Windows 64 and Linux 64, both using flavors of Oracle Java 1.7 JVMs). @ZhekaKozlov says he did not see any difference between the methods.

To run: (after cloning repository)

mvn clean install
java -jar .\target\microbenchmarks.jar function.ReturnEnumObjectVersusPrimitiveBenchmark -i 5 -wi 5 -f 1
Branch answered 6/4, 2015 at 14:1 Comment(1)
Comments are not for extended discussion; this conversation has been moved to chat.Telles
A
159

TL;DR: You should not put BLIND trust into anything.

First things first: it is important to verify the experimental data before jumping to the conclusions from them. Just claiming something is 3x faster/slower is odd, because you really need to follow up on the reason for the performance difference, not just trust the numbers. This is especially important for nano-benchmarks like you have.

Second, the experimenters should clearly understand what they control and what they don't. In your particular example, you are returning the value from @Benchmark methods, but can you be reasonably sure the callers outside will do the same thing for primitive and the reference? If you ask yourself this question, then you'll realize you are basically measuring the test infrastructure.

Down to the point. On my machine (i5-4210U, Linux x86_64, JDK 8u40), the test yields:

Benchmark                    (value)   Mode  Samples  Score   Error   Units
...benchmarkReturnOrdinal          3  thrpt        5  0.876 ± 0.023  ops/ns
...benchmarkReturnOrdinal          2  thrpt        5  0.876 ± 0.009  ops/ns
...benchmarkReturnOrdinal          1  thrpt        5  0.832 ± 0.048  ops/ns
...benchmarkReturnReference        3  thrpt        5  0.292 ± 0.006  ops/ns
...benchmarkReturnReference        2  thrpt        5  0.286 ± 0.024  ops/ns
...benchmarkReturnReference        1  thrpt        5  0.293 ± 0.008  ops/ns

Okay, so reference tests appear 3x slower. But wait, it uses an old JMH (1.1.1), let's update to current latest (1.7.1):

Benchmark                    (value)   Mode  Cnt  Score   Error   Units
...benchmarkReturnOrdinal          3  thrpt    5  0.326 ± 0.010  ops/ns
...benchmarkReturnOrdinal          2  thrpt    5  0.329 ± 0.004  ops/ns
...benchmarkReturnOrdinal          1  thrpt    5  0.329 ± 0.004  ops/ns
...benchmarkReturnReference        3  thrpt    5  0.288 ± 0.005  ops/ns
...benchmarkReturnReference        2  thrpt    5  0.288 ± 0.005  ops/ns
...benchmarkReturnReference        1  thrpt    5  0.288 ± 0.002  ops/ns

Oops, now they are only barely slower. BTW, this also tells us the test is infrastructure-bound. Okay, can we see what really happens?

If you build the benchmarks, and look around what exactly calls your @Benchmark methods, then you'll see something like:

public void benchmarkReturnOrdinal_thrpt_jmhStub(InfraControl control, RawResults result, ReturnEnumObjectVersusPrimitiveBenchmark_jmh l_returnenumobjectversusprimitivebenchmark0_0, Blackhole_jmh l_blackhole1_1) throws Throwable {
    long operations = 0;
    long realTime = 0;
    result.startTime = System.nanoTime();
    do {
        l_blackhole1_1.consume(l_longname.benchmarkReturnOrdinal());
        operations++;
    } while(!control.isDone);
    result.stopTime = System.nanoTime();
    result.realTime = realTime;
    result.measuredOps = operations;
}

That l_blackhole1_1 has a consume method, which "consumes" the values (see Blackhole for rationale). Blackhole.consume has overloads for references and primitives, and that alone is enough to justify the performance difference.

There is a rationale why these methods look different: they are trying to be as fast as possible for their types of argument. They do not necessarily exhibit the same performance characteristics, even though we try to match them, hence the more symmetric result with newer JMH. Now, you can even go to -prof perfasm to see the generated code for your tests and see why the performance is different, but that's beyond the point here.

If you really want to understand how returning the primitive and/or reference differs performance-wise, you would need to enter a big scary grey zone of nuanced performance benchmarking. E.g. something like this test:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(5)
public class PrimVsRef {

    @Benchmark
    public void prim() {
        doPrim();
    }

    @Benchmark
    public void ref() {
        doRef();
    }

    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    private int doPrim() {
        return 42;
    }

    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    private Object doRef() {
        return this;
    }

}

...which yields the same result for primitives and references:

Benchmark       Mode  Cnt  Score   Error  Units
PrimVsRef.prim  avgt   25  2.637 ± 0.017  ns/op
PrimVsRef.ref   avgt   25  2.634 ± 0.005  ns/op

As I said above, these tests require following up on the reasons for the results. In this case, the generated code for both is almost the same, and that explains the result.

prim:

                  [Verified Entry Point]
 12.69%    1.81%    0x00007f5724aec100: mov    %eax,-0x14000(%rsp)
  0.90%    0.74%    0x00007f5724aec107: push   %rbp
  0.01%    0.01%    0x00007f5724aec108: sub    $0x30,%rsp         
 12.23%   16.00%    0x00007f5724aec10c: mov    $0x2a,%eax   ; load "42"
  0.95%    0.97%    0x00007f5724aec111: add    $0x30,%rsp
           0.02%    0x00007f5724aec115: pop    %rbp
 37.94%   54.70%    0x00007f5724aec116: test   %eax,0x10d1aee4(%rip)        
  0.04%    0.02%    0x00007f5724aec11c: retq  

ref:

                  [Verified Entry Point]
 13.52%    1.45%    0x00007f1887e66700: mov    %eax,-0x14000(%rsp)
  0.60%    0.37%    0x00007f1887e66707: push   %rbp
           0.02%    0x00007f1887e66708: sub    $0x30,%rsp         
 13.63%   16.91%    0x00007f1887e6670c: mov    %rsi,%rax     ; load "this"
  0.50%    0.49%    0x00007f1887e6670f: add    $0x30,%rsp
  0.01%             0x00007f1887e66713: pop    %rbp
 39.18%   57.65%    0x00007f1887e66714: test   %eax,0xe3e78e6(%rip)
  0.02%             0x00007f1887e6671a: retq   

[sarcasm] See how easy it is! [/sarcasm]

The pattern is: the simpler the question, the more you have to work out to make a plausible and reliable answer.

Amaurosis answered 6/4, 2015 at 18:9 Comment(18)
Competent answer. So what do you recommend to perform a valid benchmark? I'd suggest building a loop and an unoptimizable consumer into the benchmark method itself. That way the test framework performance disappears in the noise.Lustful
Uh, let me first answer generically. I recommend to start from something, look at what it finally does, and fix it until it does what you want.Amaurosis
Now, more concrete answer. JMH generates the loop itself, and it calls into the "unoptimizeable" Blackhole.consume for the user. You can probably pull it into @Benchmark method, and use the non-inlineable method to sink the results into, but that only works until you meet a smarter optimizer... While we can reconsider what JMH does undercover when that happens, the user hacks would inevitably lag behind. Then, users who put more trust in their holy code rather than into a holy benchmarking framework, will burn in hell!Amaurosis
Thanks for the detailed explanation. The 1.7.1 version of JMH does indeed show virtually the same times. If I had started with that version, I could have avoided the whole question. As I wrote, I started by comparing different look up functions, which were not "infrastructure bound", and then when I rechecked the function I actually wrote against the previous tests, I saw this big difference. The purpose of asking the question was to find out what was the difference. (Trust me, I started trying to decompile, etc to see what was happening.)Branch
May they burn. Btw, why don't you implement consume as if (volatileRead() != volatileRead()) volatileStore(argument);? Seems simpler and faster. The branch will be ~100% predicted to be not taken and independent volatile loads from L1 on x86 are pretty much cost free.Lustful
@usr: This is briefly discussed in the Blackhole implementation notes in the Blackhole source itself. In short, you should allow objects to escape on warmup to compile the slow path that does the stores. Blackhole design is much more complicated than anyone (including me) normally expects, there is no "simple" answer there.Amaurosis
@SamGoldberg: Understood, not blaming you. I'm actually surprised how the thread had blown up with unrelated answers, before addressing an obvious one. I guess it's far too simple to assume if you don't see something (e.g. what happens in @Benchmark calling code), it costs nothing :)Amaurosis
I read the source now and that comment doesn't make sense to me. If the slow path is never taken it doesn't matter if it is compiled or not. It should be statically live but runtime dead. That's all that is needed. If you fear that volatile is being optimized away - start a timer that once every hour reads and writes the field and interacts with the environment such as time or files.Lustful
If the slow path is not taken, we might as well skip allocating the object, and work with exploded fields. When we "detect" the write, deoptimize, allocate the object, and continue writing it. Again, designing Blackhole is way beyond "doesn't matter", "that's all is needed", etc. Many simple implementations have fallen :)Amaurosis
Aleksey: actually what was the difference between 1.1.1 Blackhole code and the 1.7.1 code?Branch
@SamGoldberg: I think you can find the answers in Mercurial history: hg.openjdk.java.net/code-tools/jmh/log/96d8047fbf9a/jmh-core/… We were trying to align the performance characteristics for BH methods, but succeeded only so much.Amaurosis
@AlekseyShipilev well, Hotspot is new territory for me. The .NET JIT does not do anything clever (or well). It is primitive. Let's just exchange our JITs so each of us can be happy :) You get a reliable slow JIT and I get Hotspot. Maybe you should talk to the Hotspot people and have them build a blackhole JIT intrinsic for you. Benchmarking is an important use case.Lustful
@Lustful With Hotspot, you can use -Xint, and most of your benchmarking troubles will go away. (Please don't) "Power companies hate me for that." (c)Amaurosis
@AlekseyShipilev with .NET -Xint is active all the time! ;-) (Kidding - .NET never interprets which is another weakness).Lustful
Great answer. Personally I'd definitely just look at the assembly for the given tests, because these kind of micro-micro tests are horribly easy to get wrong even with jmh as this question nicely shows. If you're that worried about performance, then understanding assembly is an incredibly useful tool. @Lustful If I ever have enough time I actually want to implement something like jmh for .NET (I have about a third done) and it's oh so very, very much easier there than compared to Java/HotSpot ;)Rickierickman
@Rickierickman I've had the same idea (r.e. JMH for .NET), see github.com/minibench/minibench.warren for the current status. There's an example of a test here github.com/minibench/minibench.warren/blob/master/…Fasto
@Aleksey Shipilev can I ask you to response #42013158 . It is enough hard to understand for me because hard to find someone who understands JMM fullyIke
@AlekseyShipilev, please visit my subject: #46019308 Thanks in advance :)Splatter
M
5

To clear the misconception of reference and memory some have fallen into (@Mzf), let's dive into the Java Virtual Machine Specification. But before going there, one thing must be clarified - an object can never be retrieved from memory, only its fields can. In fact, there is no opcode that would perform such extensive operation.

This document defines reference as a stack type (so that it may be a result or an argument to instructions performing operations on stack) of 1st category - the category of types taking a single stack word (32 bits). See table 2.3 A list of Java Stack Types.

Furthermore, if the method invocation completes normally according to the specification, a value popped from the top of the stack is pushed onto the stack of method´s invoker (section 2.6.4).

Your question is what causes the difference of execution times. Chapter 2 foreword answers:

Implementation details that are not part of the Java Virtual Machine's specification would unnecessarily constrain the creativity of implementors. For example, the memory layout of run-time data areas, the garbage-collection algorithm used, and any internal optimization of the Java Virtual Machine instructions (for example, translating them into machine code) are left to the discretion of the implementor.

In other words, because no such thing as a performace penalty concerning usage of reference is stated in the document for logical reasons (it's eventually just a stack word as int or float are), you're left with searching the source code of your implementation or never finding out at all.

In extent, we shouldn't actually always blame the implementation, there are some clues you can take when looking for your answers. Java defines separate instructions for manipulating numbers and references. Reference-manipulating instructions start with a (e. g. astore, aload or areturn) and are the only instructions allowed to work with references. In particular you may be interested in looking at areturn´s implementation.

Millihenry answered 6/4, 2015 at 15:4 Comment(3)
Do references necessarily take up 32 bits?Luckett
It is not immediately obvious how does this answer the question.Monsoon
Talking about Java bytecode when wondering about the performance of a Java program always misses the point. "look at the implementation of areturn" doesn't make any sense - that's not how modern compilers work (even the HotSpot interpreter doesn't really interpret one instruction at a time any more for performance reasons)Rickierickman

© 2022 - 2024 — McMap. All rights reserved.