Benchmarking the following Java code using jmh:
interface MyInterface {
public int test(int i);
}
class A implements MyInterface {
public int test(int i) {
return (int)Math.sin(Math.cos(i));
}
}
@State(Scope.Thread)
public class MyBenchmark {
public MyInterface inter;
@Setup(Level.Trial)
public void init() {
inter = new A();
}
@Benchmark
public void testMethod(Blackhole sink) {
int[] res = new int[2];
res[0] = inter.test(1);
res[1] = inter.test(1);
sink.consume(res);
}
}
Using mvn package && java -XX:-UseCompressedOops -XX:CompileCommand='print, *.testMethod' -jar target/benchmarks.jar -wi 10 -i 1 -f 1
, I was able to get the assembly, and if we focus on the one from C2 (as shown below), we can see that both cos
and sin
are called twice.
ImmutableOopMap{}pc offsets: 796 812 828 Compiled method (c2) 402 563 4 org.sample.MyBenchmark::testMethod (42 bytes)
total in heap [0x00007efd3d74fb90,0x00007efd3d7503a0] = 2064
relocation [0x00007efd3d74fcd0,0x00007efd3d74fd08] = 56
constants [0x00007efd3d74fd20,0x00007efd3d74fd40] = 32
main code [0x00007efd3d74fd40,0x00007efd3d750040] = 768
stub code [0x00007efd3d750040,0x00007efd3d750068] = 40
oops [0x00007efd3d750068,0x00007efd3d750070] = 8
metadata [0x00007efd3d750070,0x00007efd3d750080] = 16
scopes data [0x00007efd3d750080,0x00007efd3d750108] = 136
scopes pcs [0x00007efd3d750108,0x00007efd3d750358] = 592
dependencies [0x00007efd3d750358,0x00007efd3d750360] = 8
handler table [0x00007efd3d750360,0x00007efd3d750390] = 48
nul chk table [0x00007efd3d750390,0x00007efd3d7503a0] = 16
----------------------------------------------------------------------
org/sample/MyBenchmark.testMethod(Lorg/openjdk/jmh/infra/Blackhole;)V [0x00007efd3d74fd40, 0x00007efd3d750068] 808 bytes
[Constants]
0x00007efd3d74fd20 (offset: 0): 0x00000000 0x3ff0000000000000
0x00007efd3d74fd24 (offset: 4): 0x3ff00000
0x00007efd3d74fd28 (offset: 8): 0xf4f4f4f4 0xf4f4f4f4f4f4f4f4
0x00007efd3d74fd2c (offset: 12): 0xf4f4f4f4
0x00007efd3d74fd30 (offset: 16): 0xf4f4f4f4 0xf4f4f4f4f4f4f4f4
0x00007efd3d74fd34 (offset: 20): 0xf4f4f4f4
0x00007efd3d74fd38 (offset: 24): 0xf4f4f4f4 0xf4f4f4f4f4f4f4f4
0x00007efd3d74fd3c (offset: 28): 0xf4f4f4f4
Argument 0 is unknown.RIP: 0x7efd3d74fd40 Code size: 0x00000328
[Entry Point]
# {method} {0x00007efd35857f08} 'testMethod' '(Lorg/openjdk/jmh/infra/Blackhole;)V' in 'org/sample/MyBenchmark'
# this: rsi:rsi = 'org/sample/MyBenchmark'
# parm0: rdx:rdx = 'org/openjdk/jmh/infra/Blackhole'
# [sp+0x30] (sp of caller)
0x00007efd3d74fd40: cmp 0x8(%rsi),%rax ; {no_reloc}
0x00007efd3d74fd44: jne 0x7efd35c99c60 ; {runtime_call ic_miss_stub}
0x00007efd3d74fd4a: nop
0x00007efd3d74fd4c: nopl 0x0(%rax)
[Verified Entry Point]
0x00007efd3d74fd50: mov %eax,0xfffffffffffec000(%rsp)
0x00007efd3d74fd57: push %rbp
0x00007efd3d74fd58: sub $0x20,%rsp ;*synchronization entry
; - org.sample.MyBenchmark::testMethod@-1 (line 64)
0x00007efd3d74fd5c: mov %rdx,(%rsp)
0x00007efd3d74fd60: mov %rsi,%rbp
0x00007efd3d74fd63: mov 0x60(%r15),%rbx
0x00007efd3d74fd67: mov %rbx,%r10
0x00007efd3d74fd6a: add $0x1a8,%r10
0x00007efd3d74fd71: cmp 0x70(%r15),%r10
0x00007efd3d74fd75: jnb 0x7efd3d74ffcc
0x00007efd3d74fd7b: mov %r10,0x60(%r15)
0x00007efd3d74fd7f: prefetchnta 0xc0(%r10)
0x00007efd3d74fd87: movq $0x1,(%rbx)
0x00007efd3d74fd8e: prefetchnta 0x100(%r10)
0x00007efd3d74fd96: mov %rbx,%rdi
0x00007efd3d74fd99: add $0x18,%rdi
0x00007efd3d74fd9d: prefetchnta 0x140(%r10)
0x00007efd3d74fda5: prefetchnta 0x180(%r10)
0x00007efd3d74fdad: movabs $0x7efd350d9b38,%r10 ; {metadata({type array int})}
0x00007efd3d74fdb7: mov %r10,0x8(%rbx)
0x00007efd3d74fdbb: movl $0x64,0x10(%rbx)
0x00007efd3d74fdc2: mov $0x32,%ecx
0x00007efd3d74fdc7: xor %rax,%rax
0x00007efd3d74fdca: shl $0x3,%rcx
0x00007efd3d74fdce: rep stosb (%rdi) ;*newarray {reexecute=0 rethrow=0 return_oop=0}
; - org.sample.MyBenchmark::testMethod@4 (line 65)
0x00007efd3d74fdd1: mov 0x10(%rbp),%r10 ;*getfield inter {reexecute=0 rethrow=0 return_oop=0}
; - org.sample.MyBenchmark::testMethod@20 (line 67)
0x00007efd3d74fdd5: mov 0x8(%r10),%r10 ; implicit exception: dispatches to 0x00007efd3d74fffd
0x00007efd3d74fdd9: movabs $0x7efd3587f8c8,%r11 ; {metadata('org/sample/A')}
0x00007efd3d74fde3: cmp %r11,%r10
0x00007efd3d74fde6: jne 0x7efd3d74fffd ;*synchronization entry
; - org.sample.A::test@-1 (line 49)
; - org.sample.MyBenchmark::testMethod@24 (line 67)
0x00007efd3d74fdec: vmovsd 0xffffff2c(%rip),%xmm0 ; {section_word}
0x00007efd3d74fdf4: vmovq %xmm0,%r13
0x00007efd3d74fdf9: movabs $0x7efd35c53b33,%r10
0x00007efd3d74fe03: callq %r10 ;*invokestatic cos {reexecute=0 rethrow=0 return_oop=0}
; - org.sample.A::test@2 (line 49)
; - org.sample.MyBenchmark::testMethod@24 (line 67)
0x00007efd3d74fe06: movabs $0x7efd35c5349c,%r10
0x00007efd3d74fe10: callq %r10 ;*invokestatic sin {reexecute=0 rethrow=0 return_oop=0}
; - org.sample.A::test@5 (line 49)
; - org.sample.MyBenchmark::testMethod@24 (line 67)
0x00007efd3d74fe13: vcvttsd2si %xmm0,%r11d
0x00007efd3d74fe17: cmp $0x80000000,%r11d
0x00007efd3d74fe1e: jne 0x7efd3d74fe30
0x00007efd3d74fe20: sub $0x8,%rsp
0x00007efd3d74fe24: vmovsd %xmm0,(%rsp)
0x00007efd3d74fe29: callq 0x7efd35ca745b ; {runtime_call StubRoutines (2)}
0x00007efd3d74fe2e: pop %r11
0x00007efd3d74fe30: mov %r11d,0x18(%rbx) ;*iastore {reexecute=0 rethrow=0 return_oop=0}
; - org.sample.MyBenchmark::testMethod@29 (line 67)
0x00007efd3d74fe34: mov $0x1,%ebp
0x00007efd3d74fe39: jmp 0x7efd3d74fe43
0x00007efd3d74fe3b: nopl 0x0(%rax,%rax)
0x00007efd3d74fe40: mov %r11d,%ebp ;*synchronization entry
; - org.sample.A::test@-1 (line 49)
; - org.sample.MyBenchmark::testMethod@24 (line 67)
0x00007efd3d74fe43: vmovq %r13,%xmm0
0x00007efd3d74fe48: movabs $0x7efd35c53b33,%r10
0x00007efd3d74fe52: callq %r10 ;*invokestatic cos {reexecute=0 rethrow=0 return_oop=0}
; - org.sample.A::test@2 (line 49)
; - org.sample.MyBenchmark::testMethod@24 (line 67)
0x00007efd3d74fe55: movabs $0x7efd35c5349c,%r10
0x00007efd3d74fe5f: callq %r10 ;*invokestatic sin {reexecute=0 rethrow=0 return_oop=0}
; - org.sample.A::test@5 (line 49)
; - org.sample.MyBenchmark::testMethod@24 (line 67)
0x00007efd3d74fe62: vcvttsd2si %xmm0,%r11d
0x00007efd3d74fe66: cmp $0x80000000,%r11d
0x00007efd3d74fe6d: jne 0x7efd3d74fe7f
0x00007efd3d74fe6f: sub $0x8,%rsp
0x00007efd3d74fe73: vmovsd %xmm0,(%rsp)
0x00007efd3d74fe78: callq 0x7efd35ca745b ; {runtime_call StubRoutines (2)}
0x00007efd3d74fe7d: pop %r11
0x00007efd3d74fe7f: mov %r11d,0x18(%rbx,%rbp,4) ;*synchronization entry
; - org.sample.A::test@-1 (line 49)
; - org.sample.MyBenchmark::testMethod@24 (line 67)
0x00007efd3d74fe84: vmovq %r13,%xmm0
0x00007efd3d74fe89: movabs $0x7efd35c53b33,%r10
0x00007efd3d74fe93: callq %r10 ;*invokestatic cos {reexecute=0 rethrow=0 return_oop=0}
; - org.sample.A::test@2 (line 49)
; - org.sample.MyBenchmark::testMethod@24 (line 67)
0x00007efd3d74fe96: movabs $0x7efd35c5349c,%r10
0x00007efd3d74fea0: callq %r10 ;*invokestatic sin {reexecute=0 rethrow=0 return_oop=0}
; - org.sample.A::test@5 (line 49)
; - org.sample.MyBenchmark::testMethod@24 (line 67)
0x00007efd3d74fea3: vcvttsd2si %xmm0,%r11d
0x00007efd3d74fea7: cmp $0x80000000,%r11d
0x00007efd3d74feae: jne 0x7efd3d74fec0
0x00007efd3d74feb0: sub $0x8,%rsp
0x00007efd3d74feb4: vmovsd %xmm0,(%rsp)
0x00007efd3d74feb9: callq 0x7efd35ca745b ; {runtime_call StubRoutines (2)}
0x00007efd3d74febe: pop %r11
0x00007efd3d74fec0: mov %r11d,0x1c(%rbx,%rbp,4) ;*synchronization entry
; - org.sample.A::test@-1 (line 49)
; - org.sample.MyBenchmark::testMethod@24 (line 67)
0x00007efd3d74fec5: vmovq %r13,%xmm0
0x00007efd3d74feca: movabs $0x7efd35c53b33,%r10
0x00007efd3d74fed4: callq %r10 ;*invokestatic cos {reexecute=0 rethrow=0 return_oop=0}
; - org.sample.A::test@2 (line 49)
; - org.sample.MyBenchmark::testMethod@24 (line 67)
0x00007efd3d74fed7: movabs $0x7efd35c5349c,%r10
0x00007efd3d74fee1: callq %r10 ;*invokestatic sin {reexecute=0 rethrow=0 return_oop=0}
; - org.sample.A::test@5 (line 49)
; - org.sample.MyBenchmark::testMethod@24 (line 67)
0x00007efd3d74fee4: vcvttsd2si %xmm0,%r11d
0x00007efd3d74fee8: cmp $0x80000000,%r11d
0x00007efd3d74feef: jne 0x7efd3d74ff01
0x00007efd3d74fef1: sub $0x8,%rsp
0x00007efd3d74fef5: vmovsd %xmm0,(%rsp)
0x00007efd3d74fefa: callq 0x7efd35ca745b ; {runtime_call StubRoutines (2)}
0x00007efd3d74feff: pop %r11
0x00007efd3d74ff01: mov %r11d,0x20(%rbx,%rbp,4) ;*synchronization entry
; - org.sample.A::test@-1 (line 49)
; - org.sample.MyBenchmark::testMethod@24 (line 67)
0x00007efd3d74ff06: vmovq %r13,%xmm0
0x00007efd3d74ff0b: movabs $0x7efd35c53b33,%r10
0x00007efd3d74ff15: callq %r10 ;*invokestatic cos {reexecute=0 rethrow=0 return_oop=0}
; - org.sample.A::test@2 (line 49)
; - org.sample.MyBenchmark::testMethod@24 (line 67)
0x00007efd3d74ff18: movabs $0x7efd35c5349c,%r10
0x00007efd3d74ff22: callq %r10 ;*invokestatic sin {reexecute=0 rethrow=0 return_oop=0}
; - org.sample.A::test@5 (line 49)
; - org.sample.MyBenchmark::testMethod@24 (line 67)
0x00007efd3d74ff25: vcvttsd2si %xmm0,%r11d
0x00007efd3d74ff29: cmp $0x80000000,%r11d
0x00007efd3d74ff30: jne 0x7efd3d74ff42
0x00007efd3d74ff32: sub $0x8,%rsp
0x00007efd3d74ff36: vmovsd %xmm0,(%rsp)
0x00007efd3d74ff3b: callq 0x7efd35ca745b ; {runtime_call StubRoutines (2)}
0x00007efd3d74ff40: pop %r11
0x00007efd3d74ff42: mov %r11d,0x24(%rbx,%rbp,4) ;*iastore {reexecute=0 rethrow=0 return_oop=0}
; - org.sample.MyBenchmark::testMethod@29 (line 67)
0x00007efd3d74ff47: mov %ebp,%r11d
0x00007efd3d74ff4a: add $0x4,%r11d ;*iinc {reexecute=0 rethrow=0 return_oop=0}
; - org.sample.MyBenchmark::testMethod@30 (line 66)
0x00007efd3d74ff4e: cmp $0x61,%r11d
0x00007efd3d74ff52: jl 0x7efd3d74fe40 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
; - org.sample.MyBenchmark::testMethod@13 (line 66)
0x00007efd3d74ff58: cmp $0x64,%r11d
0x00007efd3d74ff5c: jnl 0x7efd3d74ffac
0x00007efd3d74ff5e: add $0x4,%ebp ;*iinc {reexecute=0 rethrow=0 return_oop=0}
; - org.sample.MyBenchmark::testMethod@30 (line 66)
0x00007efd3d74ff61: nop ;*synchronization entry
; - org.sample.A::test@-1 (line 49)
; - org.sample.MyBenchmark::testMethod@24 (line 67)
0x00007efd3d74ff64: vmovq %r13,%xmm0
0x00007efd3d74ff69: movabs $0x7efd35c53b33,%r10
0x00007efd3d74ff73: callq %r10 ;*invokestatic cos {reexecute=0 rethrow=0 return_oop=0}
; - org.sample.A::test@2 (line 49)
; - org.sample.MyBenchmark::testMethod@24 (line 67)
0x00007efd3d74ff76: movabs $0x7efd35c5349c,%r10
0x00007efd3d74ff80: callq %r10 ;*invokestatic sin {reexecute=0 rethrow=0 return_oop=0}
; - org.sample.A::test@5 (line 49)
; - org.sample.MyBenchmark::testMethod@24 (line 67)
0x00007efd3d74ff83: vcvttsd2si %xmm0,%r10d
0x00007efd3d74ff87: cmp $0x80000000,%r10d
0x00007efd3d74ff8e: jne 0x7efd3d74ffa0
0x00007efd3d74ff90: sub $0x8,%rsp
0x00007efd3d74ff94: vmovsd %xmm0,(%rsp)
0x00007efd3d74ff99: callq 0x7efd35ca745b ; {runtime_call StubRoutines (2)}
0x00007efd3d74ff9e: pop %r10
0x00007efd3d74ffa0: mov %r10d,0x18(%rbx,%rbp,4) ;*iastore {reexecute=0 rethrow=0 return_oop=0}
; - org.sample.MyBenchmark::testMethod@29 (line 67)
0x00007efd3d74ffa5: incl %ebp ;*iinc {reexecute=0 rethrow=0 return_oop=0}
; - org.sample.MyBenchmark::testMethod@30 (line 66)
0x00007efd3d74ffa7: cmp $0x64,%ebp
0x00007efd3d74ffaa: jl 0x7efd3d74ff64
0x00007efd3d74ffac: mov (%rsp),%rsi
0x00007efd3d74ffb0: test %rsi,%rsi
0x00007efd3d74ffb3: je 0x7efd3d74ffe8 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
; - org.sample.MyBenchmark::testMethod@13 (line 66)
0x00007efd3d74ffb5: mov %rbx,%rdx
0x00007efd3d74ffb8: nop
0x00007efd3d74ffbb: callq 0x7efd362c50e0 ; ImmutableOopMap{}
;*invokevirtual consume {reexecute=0 rethrow=0 return_oop=0}
; - org.sample.MyBenchmark::testMethod@38 (line 69)
; {optimized virtual_call}
0x00007efd3d74ffc0: add $0x20,%rsp
0x00007efd3d74ffc4: pop %rbp
0x00007efd3d74ffc5: test %eax,0x18f98035(%rip) ; {poll_return}
0x00007efd3d74ffcb: retq
0x00007efd3d74ffcc: mov $0x64,%edx
0x00007efd3d74ffd1: movabs $0x7efd350d9b38,%rsi ; {metadata({type array int})}
0x00007efd3d74ffdb: callq 0x7efd35d5fd60 ; ImmutableOopMap{rbp=Oop [0]=Oop }
;*newarray {reexecute=0 rethrow=0 return_oop=1}
; - org.sample.MyBenchmark::testMethod@4 (line 65)
; {runtime_call _new_array_Java}
0x00007efd3d74ffe0: mov %rax,%rbx
0x00007efd3d74ffe3: jmpq 0x7efd3d74fdd1
0x00007efd3d74ffe8: mov $0xfffffff6,%esi
0x00007efd3d74ffed: mov %rbx,%rbp
0x00007efd3d74fff0: nop
0x00007efd3d74fff3: callq 0x7efd35c9b560 ; ImmutableOopMap{rbp=Oop }
;*invokevirtual consume {reexecute=0 rethrow=0 return_oop=0}
; - org.sample.MyBenchmark::testMethod@38 (line 69)
; {runtime_call UncommonTrapBlob}
0x00007efd3d74fff8: callq 0x7efd55167aa0 ; {runtime_call}
0x00007efd3d74fffd: mov $0xffffff86,%esi
0x00007efd3d750002: mov %rbx,0x8(%rsp)
0x00007efd3d750007: callq 0x7efd35c9b560 ; ImmutableOopMap{rbp=Oop [0]=Oop [8]=Oop }
;*aload_3 {reexecute=0 rethrow=0 return_oop=0}
; - org.sample.MyBenchmark::testMethod@16 (line 67)
; {runtime_call UncommonTrapBlob}
0x00007efd3d75000c: callq 0x7efd55167aa0 ;*newarray {reexecute=0 rethrow=0 return_oop=0}
; - org.sample.MyBenchmark::testMethod@4 (line 65)
; {runtime_call}
0x00007efd3d750011: mov %rax,%rsi
0x00007efd3d750014: jmp 0x7efd3d750019
0x00007efd3d750016: mov %rax,%rsi ;*invokevirtual consume {reexecute=0 rethrow=0 return_oop=0}
; - org.sample.MyBenchmark::testMethod@38 (line 69)
0x00007efd3d750019: add $0x20,%rsp
0x00007efd3d75001d: pop %rbp
0x00007efd3d75001e: jmpq 0x7efd35d64160 ; {runtime_call _rethrow_Java}
0x00007efd3d750023: hlt
0x00007efd3d750024: hlt
0x00007efd3d750025: hlt
0x00007efd3d750026: hlt
0x00007efd3d750027: hlt
0x00007efd3d750028: hlt
0x00007efd3d750029: hlt
0x00007efd3d75002a: hlt
0x00007efd3d75002b: hlt
0x00007efd3d75002c: hlt
0x00007efd3d75002d: hlt
0x00007efd3d75002e: hlt
0x00007efd3d75002f: hlt
0x00007efd3d750030: hlt
0x00007efd3d750031: hlt
0x00007efd3d750032: hlt
0x00007efd3d750033: hlt
0x00007efd3d750034: hlt
0x00007efd3d750035: hlt
0x00007efd3d750036: hlt
0x00007efd3d750037: hlt
0x00007efd3d750038: hlt
0x00007efd3d750039: hlt
0x00007efd3d75003a: hlt
0x00007efd3d75003b: hlt
0x00007efd3d75003c: hlt
0x00007efd3d75003d: hlt
0x00007efd3d75003e: hlt
0x00007efd3d75003f: hlt
I was expecting that the result from inter.test
is cached or something so that inter.test
(sin and cos) is called only once. Any options I can use to make JVM (JIT) to do so? Or what's preventing JVM (JIT) from seeing that method is pure?
ENV:
$ java -version
openjdk version "9-internal"
OpenJDK Runtime Environment (build 9-internal+0-2016-04-14-195246.buildd.src)
OpenJDK 64-Bit Server VM (build 9-internal+0-2016-04-14-195246.buildd.src, mixed mode)
# jmh version
<jmh.version>1.19</jmh.version>
const
ness applied and type checking, and other nice stuff, so the compiler is well aware what is it doing. In Java you need to hand-held the JVM in case of performance, but overall I have horrible experience with performance-related coding with Java, yours just goes along the lines of what I have ever seen. Only sane approach for me is to accept Java is what it is, simple programming language, easy to learn to intermediate level and commercially successful. Can't find anything more good about it, and since C++11 the C++ always wins even on source elegance. – VicechairmanMath.sin()
andMath.cos()
are pure functions (no side effects), so it can CSE them. Of course, if Java supports unmasked FP exceptions, math functions can have side effects. Even1.0
will raise the FP "Inexact" exception. Still, I'm surprised Java9 doesn't do constant-propagation through sin() and cos(). Maybe run it longer until it re-JITs with more optimization? – Adamantns/op
? Can you show the entire method? For the past several versions the JDK uses intrinsics forMath.sin
andMath.cos
, which boil down to an inlined fast path usingfsin
andfcos
and a slower path depending on if the argument needs reduction and stuff like that. So your excerpt isn't enough to conclude that thecall %r10
line is actually being executed. – Organogenesisns/op
number is probably not very interesting, for it's high hardware specific. I have update to include the complete assembly. Even iffsin
is used, it's used twice. I was expecting thattest()
is called once, and the result is reused for the second call. – Orleanistfsin
performance has been roughly constant on x86 hardware for about a decade. Anyways, as I mentioned, the code excerpt you linked above probably isn't even being executed. It's the "slow path", but if you look around you'll probably find the fast path where it just loads the final result directly from a memory location (since your input is constant). – Organogenesiscall %r10
) in one case, but in the other emits fast-path code that callsfsin
andfcos
inline (whether that's even faster is up for debate). In fact for some cases where the argument is a constant and less thanPI / 4
it doesn't call any trig method at all and just loads the final answer from memory (from the constant pool maintained by the JIT). – OrganogenesisMatcher::strict_fp_requires_explicit_rounding
seems to be true, and it goes down that path, which means that values <Math.PI / 4
are handled much differently (inlined, faster) than values >Math.PI / 4
. You can verify this by trying for exampleMath.PI / 4 - 0.01
andMath.PI / 4 + 0.01
. For me the difference is 5x. – Organogenesisfsin
output using oracle jdk 8 and 9. The original output is obtained from openjdk-9, as shown in the ENV section.fsin
is surely the fast path ofsin
, but I am more curious about why JIT doesn't cache the result for thetest()
method call so that only a single instance ofsin
orfsin
appears in the assembly. – Orleanist