JVM JIT method recalculate for pure methods
Asked Answered
O

1

8

Benchmarking the following Java code using jmh:

interface MyInterface {
    public int test(int i);
}

class A implements MyInterface {
    public int test(int i) {
        return (int)Math.sin(Math.cos(i));
    }
}

@State(Scope.Thread)
public class MyBenchmark {
    public MyInterface inter;

    @Setup(Level.Trial)
    public void init() {
        inter = new A();
    }

    @Benchmark
    public void testMethod(Blackhole sink) {
        int[] res = new int[2];
        res[0] = inter.test(1);
        res[1] = inter.test(1);
        sink.consume(res);
    }
}

Using mvn package && java -XX:-UseCompressedOops -XX:CompileCommand='print, *.testMethod' -jar target/benchmarks.jar -wi 10 -i 1 -f 1, I was able to get the assembly, and if we focus on the one from C2 (as shown below), we can see that both cos and sin are called twice.

ImmutableOopMap{}pc offsets: 796 812 828 Compiled method (c2)     402  563       4       org.sample.MyBenchmark::testMethod (42 bytes)
 total in heap  [0x00007efd3d74fb90,0x00007efd3d7503a0] = 2064
 relocation     [0x00007efd3d74fcd0,0x00007efd3d74fd08] = 56
 constants      [0x00007efd3d74fd20,0x00007efd3d74fd40] = 32
 main code      [0x00007efd3d74fd40,0x00007efd3d750040] = 768
 stub code      [0x00007efd3d750040,0x00007efd3d750068] = 40
 oops           [0x00007efd3d750068,0x00007efd3d750070] = 8
 metadata       [0x00007efd3d750070,0x00007efd3d750080] = 16
 scopes data    [0x00007efd3d750080,0x00007efd3d750108] = 136
 scopes pcs     [0x00007efd3d750108,0x00007efd3d750358] = 592
 dependencies   [0x00007efd3d750358,0x00007efd3d750360] = 8
 handler table  [0x00007efd3d750360,0x00007efd3d750390] = 48
 nul chk table  [0x00007efd3d750390,0x00007efd3d7503a0] = 16
----------------------------------------------------------------------
org/sample/MyBenchmark.testMethod(Lorg/openjdk/jmh/infra/Blackhole;)V  [0x00007efd3d74fd40, 0x00007efd3d750068]  808 bytes
[Constants]
  0x00007efd3d74fd20 (offset:    0): 0x00000000   0x3ff0000000000000
  0x00007efd3d74fd24 (offset:    4): 0x3ff00000
  0x00007efd3d74fd28 (offset:    8): 0xf4f4f4f4   0xf4f4f4f4f4f4f4f4
  0x00007efd3d74fd2c (offset:   12): 0xf4f4f4f4
  0x00007efd3d74fd30 (offset:   16): 0xf4f4f4f4   0xf4f4f4f4f4f4f4f4
  0x00007efd3d74fd34 (offset:   20): 0xf4f4f4f4
  0x00007efd3d74fd38 (offset:   24): 0xf4f4f4f4   0xf4f4f4f4f4f4f4f4
  0x00007efd3d74fd3c (offset:   28): 0xf4f4f4f4
Argument 0 is unknown.RIP: 0x7efd3d74fd40 Code size: 0x00000328
[Entry Point]
  # {method} {0x00007efd35857f08} 'testMethod' '(Lorg/openjdk/jmh/infra/Blackhole;)V' in 'org/sample/MyBenchmark'
  # this:     rsi:rsi   = 'org/sample/MyBenchmark'
  # parm0:    rdx:rdx   = 'org/openjdk/jmh/infra/Blackhole'
  #           [sp+0x30]  (sp of caller)
  0x00007efd3d74fd40: cmp     0x8(%rsi),%rax    ;   {no_reloc}
  0x00007efd3d74fd44: jne     0x7efd35c99c60    ;   {runtime_call ic_miss_stub}
  0x00007efd3d74fd4a: nop
  0x00007efd3d74fd4c: nopl    0x0(%rax)
[Verified Entry Point]
  0x00007efd3d74fd50: mov     %eax,0xfffffffffffec000(%rsp)
  0x00007efd3d74fd57: push    %rbp
  0x00007efd3d74fd58: sub     $0x20,%rsp        ;*synchronization entry
                                                ; - org.sample.MyBenchmark::testMethod@-1 (line 64)

  0x00007efd3d74fd5c: mov     %rdx,(%rsp)
  0x00007efd3d74fd60: mov     %rsi,%rbp
  0x00007efd3d74fd63: mov     0x60(%r15),%rbx
  0x00007efd3d74fd67: mov     %rbx,%r10
  0x00007efd3d74fd6a: add     $0x1a8,%r10
  0x00007efd3d74fd71: cmp     0x70(%r15),%r10
  0x00007efd3d74fd75: jnb     0x7efd3d74ffcc
  0x00007efd3d74fd7b: mov     %r10,0x60(%r15)
  0x00007efd3d74fd7f: prefetchnta 0xc0(%r10)
  0x00007efd3d74fd87: movq    $0x1,(%rbx)
  0x00007efd3d74fd8e: prefetchnta 0x100(%r10)
  0x00007efd3d74fd96: mov     %rbx,%rdi
  0x00007efd3d74fd99: add     $0x18,%rdi
  0x00007efd3d74fd9d: prefetchnta 0x140(%r10)
  0x00007efd3d74fda5: prefetchnta 0x180(%r10)
  0x00007efd3d74fdad: movabs  $0x7efd350d9b38,%r10  ;   {metadata({type array int})}
  0x00007efd3d74fdb7: mov     %r10,0x8(%rbx)
  0x00007efd3d74fdbb: movl    $0x64,0x10(%rbx)
  0x00007efd3d74fdc2: mov     $0x32,%ecx
  0x00007efd3d74fdc7: xor     %rax,%rax
  0x00007efd3d74fdca: shl     $0x3,%rcx
  0x00007efd3d74fdce: rep stosb (%rdi)          ;*newarray {reexecute=0 rethrow=0 return_oop=0}
                                                ; - org.sample.MyBenchmark::testMethod@4 (line 65)

  0x00007efd3d74fdd1: mov     0x10(%rbp),%r10   ;*getfield inter {reexecute=0 rethrow=0 return_oop=0}
                                                ; - org.sample.MyBenchmark::testMethod@20 (line 67)

  0x00007efd3d74fdd5: mov     0x8(%r10),%r10    ; implicit exception: dispatches to 0x00007efd3d74fffd
  0x00007efd3d74fdd9: movabs  $0x7efd3587f8c8,%r11  ;   {metadata('org/sample/A')}
  0x00007efd3d74fde3: cmp     %r11,%r10
  0x00007efd3d74fde6: jne     0x7efd3d74fffd    ;*synchronization entry
                                                ; - org.sample.A::test@-1 (line 49)
                                                ; - org.sample.MyBenchmark::testMethod@24 (line 67)

  0x00007efd3d74fdec: vmovsd  0xffffff2c(%rip),%xmm0  ;   {section_word}
  0x00007efd3d74fdf4: vmovq   %xmm0,%r13
  0x00007efd3d74fdf9: movabs  $0x7efd35c53b33,%r10
  0x00007efd3d74fe03: callq   %r10              ;*invokestatic cos {reexecute=0 rethrow=0 return_oop=0}
                                                ; - org.sample.A::test@2 (line 49)
                                                ; - org.sample.MyBenchmark::testMethod@24 (line 67)

  0x00007efd3d74fe06: movabs  $0x7efd35c5349c,%r10
  0x00007efd3d74fe10: callq   %r10              ;*invokestatic sin {reexecute=0 rethrow=0 return_oop=0}
                                                ; - org.sample.A::test@5 (line 49)
                                                ; - org.sample.MyBenchmark::testMethod@24 (line 67)

  0x00007efd3d74fe13: vcvttsd2si %xmm0,%r11d
  0x00007efd3d74fe17: cmp     $0x80000000,%r11d
  0x00007efd3d74fe1e: jne     0x7efd3d74fe30
  0x00007efd3d74fe20: sub     $0x8,%rsp
  0x00007efd3d74fe24: vmovsd  %xmm0,(%rsp)
  0x00007efd3d74fe29: callq   0x7efd35ca745b    ;   {runtime_call StubRoutines (2)}
  0x00007efd3d74fe2e: pop     %r11
  0x00007efd3d74fe30: mov     %r11d,0x18(%rbx)  ;*iastore {reexecute=0 rethrow=0 return_oop=0}
                                                ; - org.sample.MyBenchmark::testMethod@29 (line 67)

  0x00007efd3d74fe34: mov     $0x1,%ebp
  0x00007efd3d74fe39: jmp     0x7efd3d74fe43
  0x00007efd3d74fe3b: nopl    0x0(%rax,%rax)
  0x00007efd3d74fe40: mov     %r11d,%ebp        ;*synchronization entry
                                                ; - org.sample.A::test@-1 (line 49)
                                                ; - org.sample.MyBenchmark::testMethod@24 (line 67)

  0x00007efd3d74fe43: vmovq   %r13,%xmm0
  0x00007efd3d74fe48: movabs  $0x7efd35c53b33,%r10
  0x00007efd3d74fe52: callq   %r10              ;*invokestatic cos {reexecute=0 rethrow=0 return_oop=0}
                                                ; - org.sample.A::test@2 (line 49)
                                                ; - org.sample.MyBenchmark::testMethod@24 (line 67)

  0x00007efd3d74fe55: movabs  $0x7efd35c5349c,%r10
  0x00007efd3d74fe5f: callq   %r10              ;*invokestatic sin {reexecute=0 rethrow=0 return_oop=0}
                                                ; - org.sample.A::test@5 (line 49)
                                                ; - org.sample.MyBenchmark::testMethod@24 (line 67)

  0x00007efd3d74fe62: vcvttsd2si %xmm0,%r11d
  0x00007efd3d74fe66: cmp     $0x80000000,%r11d
  0x00007efd3d74fe6d: jne     0x7efd3d74fe7f
  0x00007efd3d74fe6f: sub     $0x8,%rsp
  0x00007efd3d74fe73: vmovsd  %xmm0,(%rsp)
  0x00007efd3d74fe78: callq   0x7efd35ca745b    ;   {runtime_call StubRoutines (2)}
  0x00007efd3d74fe7d: pop     %r11
  0x00007efd3d74fe7f: mov     %r11d,0x18(%rbx,%rbp,4)  ;*synchronization entry
                                                ; - org.sample.A::test@-1 (line 49)
                                                ; - org.sample.MyBenchmark::testMethod@24 (line 67)

  0x00007efd3d74fe84: vmovq   %r13,%xmm0
  0x00007efd3d74fe89: movabs  $0x7efd35c53b33,%r10
  0x00007efd3d74fe93: callq   %r10              ;*invokestatic cos {reexecute=0 rethrow=0 return_oop=0}
                                                ; - org.sample.A::test@2 (line 49)
                                                ; - org.sample.MyBenchmark::testMethod@24 (line 67)

  0x00007efd3d74fe96: movabs  $0x7efd35c5349c,%r10
  0x00007efd3d74fea0: callq   %r10              ;*invokestatic sin {reexecute=0 rethrow=0 return_oop=0}
                                                ; - org.sample.A::test@5 (line 49)
                                                ; - org.sample.MyBenchmark::testMethod@24 (line 67)

  0x00007efd3d74fea3: vcvttsd2si %xmm0,%r11d
  0x00007efd3d74fea7: cmp     $0x80000000,%r11d
  0x00007efd3d74feae: jne     0x7efd3d74fec0
  0x00007efd3d74feb0: sub     $0x8,%rsp
  0x00007efd3d74feb4: vmovsd  %xmm0,(%rsp)
  0x00007efd3d74feb9: callq   0x7efd35ca745b    ;   {runtime_call StubRoutines (2)}
  0x00007efd3d74febe: pop     %r11
  0x00007efd3d74fec0: mov     %r11d,0x1c(%rbx,%rbp,4)  ;*synchronization entry
                                                ; - org.sample.A::test@-1 (line 49)
                                                ; - org.sample.MyBenchmark::testMethod@24 (line 67)

  0x00007efd3d74fec5: vmovq   %r13,%xmm0
  0x00007efd3d74feca: movabs  $0x7efd35c53b33,%r10
  0x00007efd3d74fed4: callq   %r10              ;*invokestatic cos {reexecute=0 rethrow=0 return_oop=0}
                                                ; - org.sample.A::test@2 (line 49)
                                                ; - org.sample.MyBenchmark::testMethod@24 (line 67)

  0x00007efd3d74fed7: movabs  $0x7efd35c5349c,%r10
  0x00007efd3d74fee1: callq   %r10              ;*invokestatic sin {reexecute=0 rethrow=0 return_oop=0}
                                                ; - org.sample.A::test@5 (line 49)
                                                ; - org.sample.MyBenchmark::testMethod@24 (line 67)

  0x00007efd3d74fee4: vcvttsd2si %xmm0,%r11d
  0x00007efd3d74fee8: cmp     $0x80000000,%r11d
  0x00007efd3d74feef: jne     0x7efd3d74ff01
  0x00007efd3d74fef1: sub     $0x8,%rsp
  0x00007efd3d74fef5: vmovsd  %xmm0,(%rsp)
  0x00007efd3d74fefa: callq   0x7efd35ca745b    ;   {runtime_call StubRoutines (2)}
  0x00007efd3d74feff: pop     %r11
  0x00007efd3d74ff01: mov     %r11d,0x20(%rbx,%rbp,4)  ;*synchronization entry
                                                ; - org.sample.A::test@-1 (line 49)
                                                ; - org.sample.MyBenchmark::testMethod@24 (line 67)

  0x00007efd3d74ff06: vmovq   %r13,%xmm0
  0x00007efd3d74ff0b: movabs  $0x7efd35c53b33,%r10
  0x00007efd3d74ff15: callq   %r10              ;*invokestatic cos {reexecute=0 rethrow=0 return_oop=0}
                                                ; - org.sample.A::test@2 (line 49)
                                                ; - org.sample.MyBenchmark::testMethod@24 (line 67)

  0x00007efd3d74ff18: movabs  $0x7efd35c5349c,%r10
  0x00007efd3d74ff22: callq   %r10              ;*invokestatic sin {reexecute=0 rethrow=0 return_oop=0}
                                                ; - org.sample.A::test@5 (line 49)
                                                ; - org.sample.MyBenchmark::testMethod@24 (line 67)

  0x00007efd3d74ff25: vcvttsd2si %xmm0,%r11d
  0x00007efd3d74ff29: cmp     $0x80000000,%r11d
  0x00007efd3d74ff30: jne     0x7efd3d74ff42
  0x00007efd3d74ff32: sub     $0x8,%rsp
  0x00007efd3d74ff36: vmovsd  %xmm0,(%rsp)
  0x00007efd3d74ff3b: callq   0x7efd35ca745b    ;   {runtime_call StubRoutines (2)}
  0x00007efd3d74ff40: pop     %r11
  0x00007efd3d74ff42: mov     %r11d,0x24(%rbx,%rbp,4)  ;*iastore {reexecute=0 rethrow=0 return_oop=0}
                                                ; - org.sample.MyBenchmark::testMethod@29 (line 67)

  0x00007efd3d74ff47: mov     %ebp,%r11d
  0x00007efd3d74ff4a: add     $0x4,%r11d        ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                                                ; - org.sample.MyBenchmark::testMethod@30 (line 66)

  0x00007efd3d74ff4e: cmp     $0x61,%r11d
  0x00007efd3d74ff52: jl      0x7efd3d74fe40    ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
                                                ; - org.sample.MyBenchmark::testMethod@13 (line 66)

  0x00007efd3d74ff58: cmp     $0x64,%r11d
  0x00007efd3d74ff5c: jnl     0x7efd3d74ffac
  0x00007efd3d74ff5e: add     $0x4,%ebp         ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                                                ; - org.sample.MyBenchmark::testMethod@30 (line 66)

  0x00007efd3d74ff61: nop                       ;*synchronization entry
                                                ; - org.sample.A::test@-1 (line 49)
                                                ; - org.sample.MyBenchmark::testMethod@24 (line 67)

  0x00007efd3d74ff64: vmovq   %r13,%xmm0
  0x00007efd3d74ff69: movabs  $0x7efd35c53b33,%r10
  0x00007efd3d74ff73: callq   %r10              ;*invokestatic cos {reexecute=0 rethrow=0 return_oop=0}
                                                ; - org.sample.A::test@2 (line 49)
                                                ; - org.sample.MyBenchmark::testMethod@24 (line 67)

  0x00007efd3d74ff76: movabs  $0x7efd35c5349c,%r10
  0x00007efd3d74ff80: callq   %r10              ;*invokestatic sin {reexecute=0 rethrow=0 return_oop=0}
                                                ; - org.sample.A::test@5 (line 49)
                                                ; - org.sample.MyBenchmark::testMethod@24 (line 67)

  0x00007efd3d74ff83: vcvttsd2si %xmm0,%r10d
  0x00007efd3d74ff87: cmp     $0x80000000,%r10d
  0x00007efd3d74ff8e: jne     0x7efd3d74ffa0
  0x00007efd3d74ff90: sub     $0x8,%rsp
  0x00007efd3d74ff94: vmovsd  %xmm0,(%rsp)
  0x00007efd3d74ff99: callq   0x7efd35ca745b    ;   {runtime_call StubRoutines (2)}
  0x00007efd3d74ff9e: pop     %r10
  0x00007efd3d74ffa0: mov     %r10d,0x18(%rbx,%rbp,4)  ;*iastore {reexecute=0 rethrow=0 return_oop=0}
                                                ; - org.sample.MyBenchmark::testMethod@29 (line 67)

  0x00007efd3d74ffa5: incl    %ebp              ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                                                ; - org.sample.MyBenchmark::testMethod@30 (line 66)

  0x00007efd3d74ffa7: cmp     $0x64,%ebp
  0x00007efd3d74ffaa: jl      0x7efd3d74ff64
  0x00007efd3d74ffac: mov     (%rsp),%rsi
  0x00007efd3d74ffb0: test    %rsi,%rsi
  0x00007efd3d74ffb3: je      0x7efd3d74ffe8    ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
                                                ; - org.sample.MyBenchmark::testMethod@13 (line 66)

  0x00007efd3d74ffb5: mov     %rbx,%rdx
  0x00007efd3d74ffb8: nop
  0x00007efd3d74ffbb: callq   0x7efd362c50e0    ; ImmutableOopMap{}
                                                ;*invokevirtual consume {reexecute=0 rethrow=0 return_oop=0}
                                                ; - org.sample.MyBenchmark::testMethod@38 (line 69)
                                                ;   {optimized virtual_call}
  0x00007efd3d74ffc0: add     $0x20,%rsp
  0x00007efd3d74ffc4: pop     %rbp
  0x00007efd3d74ffc5: test    %eax,0x18f98035(%rip)  ;   {poll_return}
  0x00007efd3d74ffcb: retq
  0x00007efd3d74ffcc: mov     $0x64,%edx
  0x00007efd3d74ffd1: movabs  $0x7efd350d9b38,%rsi  ;   {metadata({type array int})}
  0x00007efd3d74ffdb: callq   0x7efd35d5fd60    ; ImmutableOopMap{rbp=Oop [0]=Oop }
                                                ;*newarray {reexecute=0 rethrow=0 return_oop=1}
                                                ; - org.sample.MyBenchmark::testMethod@4 (line 65)
                                                ;   {runtime_call _new_array_Java}
  0x00007efd3d74ffe0: mov     %rax,%rbx
  0x00007efd3d74ffe3: jmpq    0x7efd3d74fdd1
  0x00007efd3d74ffe8: mov     $0xfffffff6,%esi
  0x00007efd3d74ffed: mov     %rbx,%rbp
  0x00007efd3d74fff0: nop
  0x00007efd3d74fff3: callq   0x7efd35c9b560    ; ImmutableOopMap{rbp=Oop }
                                                ;*invokevirtual consume {reexecute=0 rethrow=0 return_oop=0}
                                                ; - org.sample.MyBenchmark::testMethod@38 (line 69)
                                                ;   {runtime_call UncommonTrapBlob}
  0x00007efd3d74fff8: callq   0x7efd55167aa0    ;   {runtime_call}
  0x00007efd3d74fffd: mov     $0xffffff86,%esi
  0x00007efd3d750002: mov     %rbx,0x8(%rsp)
  0x00007efd3d750007: callq   0x7efd35c9b560    ; ImmutableOopMap{rbp=Oop [0]=Oop [8]=Oop }
                                                ;*aload_3 {reexecute=0 rethrow=0 return_oop=0}
                                                ; - org.sample.MyBenchmark::testMethod@16 (line 67)
                                                ;   {runtime_call UncommonTrapBlob}
  0x00007efd3d75000c: callq   0x7efd55167aa0    ;*newarray {reexecute=0 rethrow=0 return_oop=0}
                                                ; - org.sample.MyBenchmark::testMethod@4 (line 65)
                                                ;   {runtime_call}
  0x00007efd3d750011: mov     %rax,%rsi
  0x00007efd3d750014: jmp     0x7efd3d750019
  0x00007efd3d750016: mov     %rax,%rsi         ;*invokevirtual consume {reexecute=0 rethrow=0 return_oop=0}
                                                ; - org.sample.MyBenchmark::testMethod@38 (line 69)

  0x00007efd3d750019: add     $0x20,%rsp
  0x00007efd3d75001d: pop     %rbp
  0x00007efd3d75001e: jmpq    0x7efd35d64160    ;   {runtime_call _rethrow_Java}
  0x00007efd3d750023: hlt
  0x00007efd3d750024: hlt
  0x00007efd3d750025: hlt
  0x00007efd3d750026: hlt
  0x00007efd3d750027: hlt
  0x00007efd3d750028: hlt
  0x00007efd3d750029: hlt
  0x00007efd3d75002a: hlt
  0x00007efd3d75002b: hlt
  0x00007efd3d75002c: hlt
  0x00007efd3d75002d: hlt
  0x00007efd3d75002e: hlt
  0x00007efd3d75002f: hlt
  0x00007efd3d750030: hlt
  0x00007efd3d750031: hlt
  0x00007efd3d750032: hlt
  0x00007efd3d750033: hlt
  0x00007efd3d750034: hlt
  0x00007efd3d750035: hlt
  0x00007efd3d750036: hlt
  0x00007efd3d750037: hlt
  0x00007efd3d750038: hlt
  0x00007efd3d750039: hlt
  0x00007efd3d75003a: hlt
  0x00007efd3d75003b: hlt
  0x00007efd3d75003c: hlt
  0x00007efd3d75003d: hlt
  0x00007efd3d75003e: hlt
  0x00007efd3d75003f: hlt

I was expecting that the result from inter.test is cached or something so that inter.test (sin and cos) is called only once. Any options I can use to make JVM (JIT) to do so? Or what's preventing JVM (JIT) from seeing that method is pure?

ENV:

$ java -version
openjdk version "9-internal"
OpenJDK Runtime Environment (build 9-internal+0-2016-04-14-195246.buildd.src)
OpenJDK 64-Bit Server VM (build 9-internal+0-2016-04-14-195246.buildd.src, mixed mode)
# jmh version
<jmh.version>1.19</jmh.version>
Orleanist answered 31/12, 2017 at 18:2 Comment(14)
For me it seems to just completely inline those calls, which version of java and jmh are you using?Provocation
With Java 9 I see the same code. I don't believe the call can be omitted if it's not inlined, since it could have a side effect.Provocation
You can rewrite it in C++ with constness applied and type checking, and other nice stuff, so the compiler is well aware what is it doing. In Java you need to hand-held the JVM in case of performance, but overall I have horrible experience with performance-related coding with Java, yours just goes along the lines of what I have ever seen. Only sane approach for me is to accept Java is what it is, simple programming language, easy to learn to intermediate level and commercially successful. Can't find anything more good about it, and since C++11 the C++ always wins even on source elegance.Vicechairman
@JornVernee: I think the OP is hoping the JVM knows that Math.sin() and Math.cos() are pure functions (no side effects), so it can CSE them. Of course, if Java supports unmasked FP exceptions, math functions can have side effects. Even 1.0 will raise the FP "Inexact" exception. Still, I'm surprised Java9 doesn't do constant-propagation through sin() and cos(). Maybe run it longer until it re-JITs with more optimization?Adamant
What performance did you get in ns/op? Can you show the entire method? For the past several versions the JDK uses intrinsics for Math.sin and Math.cos, which boil down to an inlined fast path using fsin and fcos and a slower path depending on if the argument needs reduction and stuff like that. So your excerpt isn't enough to conclude that the call %r10 line is actually being executed.Organogenesis
@Vicechairman Rewriting in C++ would get the assembly (performance) I expected, but my constraint is to use pure Java.Orleanist
@PeterCordes It's already compiled by C2 (the final tier), and the method is completely deterministic. I don't know what could change to cause a re-JIT.Orleanist
@Organogenesis The actual ns/op number is probably not very interesting, for it's high hardware specific. I have update to include the complete assembly. Even if fsin is used, it's used twice. I was expecting that test() is called once, and the result is reused for the second call.Orleanist
@AlbertNetymk - it's interesting since it tells you if you are going down the fast or slow path. It's not that hardware dependent at all unless you have really weird hardware. For example fsin performance has been roughly constant on x86 hardware for about a decade. Anyways, as I mentioned, the code excerpt you linked above probably isn't even being executed. It's the "slow path", but if you look around you'll probably find the fast path where it just loads the final result directly from a memory location (since your input is constant).Organogenesis
The code doesn't compile the same for me in Java 8 - it has a up-front tests to check the range of the arguments, and then it calls into the slow path (call %r10) in one case, but in the other emits fast-path code that calls fsin and fcos inline (whether that's even faster is up for debate). In fact for some cases where the argument is a constant and less than PI / 4 it doesn't call any trig method at all and just loads the final answer from memory (from the constant pool maintained by the JIT).Organogenesis
You can see the JIT code generation logic for these functions here. In particular, on x86 the Matcher::strict_fp_requires_explicit_rounding seems to be true, and it goes down that path, which means that values < Math.PI / 4 are handled much differently (inlined, faster) than values > Math.PI / 4. You can verify this by trying for example Math.PI / 4 - 0.01 and Math.PI / 4 + 0.01. For me the difference is 5x.Organogenesis
@Organogenesis Thanks for the elaboration. I could reproduce fsin output using oracle jdk 8 and 9. The original output is obtained from openjdk-9, as shown in the ENV section. fsin is surely the fast path of sin, but I am more curious about why JIT doesn't cache the result for the test() method call so that only a single instance of sin or fsin appears in the assembly.Orleanist
Well it wouldn't "cache" it per se, but the optimizing compiler could eliminate the common sub-expression. My finding is that in JDK8 and early versions of JDK9, it does but only for the "fast path", not for the slow path. I'm writing a fuller answer.Organogenesis
"but my constraint is to use pure Java" - sure, that's certainly a valid approach, I'm just warning you to not spend too much effort on performance, since basically you are tuning ordinary family car, and wondering why it doesn't perform like a race car... well, it never will, it's built that way since start. Last time I had some partially good reasons to care about performance of my Java source, after finishing that task I did regret I didn't switch to native C++ library at the very beginning, counting the spent devhours afterwards it did look to even out, but the output would be still 10x.Vicechairman
O
11

As far as I know, HotSpot cannot optimize redundant calls to pure methods (i.e., calls to pure methods with identical arguments), except indirectly via inlining.

That is, if redundant calls to a pure method are all inlined at the call site, the redundancy is detected indirectly in the inlined code by usual optimizations such as CSE and GVN, and in this way the cost of the extra calls usually disappears. If the methods are not inlined, however, I don't think the JVM flags them as "pure" and thus cannot eliminate them (unlike, for example, many native compilers which can).

Still given that inlining can remove redundant calls, the question remains: why aren't the redundant Math.sin and Math.cos calls inlined and eventually optimized away?

As it turns out, Math.sin and Math.cos, like several other Math and other methods in the JDK are handled specially as intrinsic functions. Below you'll find detailed look at what happens in Java 8 and some versions of Java 9. The disassembly you showed is from a later version of Java 9 which handles this differently, which is covered at the end.

The way the trig methods are handled in the JVM is ... complicated. In principle, Math.sin and Math.cos are inlined as intrinsic methods using native FP instructions on x86, but there are caveats.

There are a lot of extraneous factors in your benchmark that make it harder to analyze, such as the array allocation, the call to Blackhole.consume, the use of both Math.sin and Math.cos, passing a constant (which can cause some trig instructions to be optimized away completely), the use of an interface A and an implementation of that interface, etc.

Instead, let's strip that cruft out and reduce it to a much simpler version, that just calls Math.sin(x) three times with an identical argument, and returns the sum:

private double i = Math.PI / 4 - 0.01;

@Benchmark
public double testMethod() {
    double res0 = Math.sin(i);
    double res1 = Math.sin(i);
    double res2 = Math.sin(i);
    return res0 + res1 + res2;
}

Running this with JHM args -bm avgt -tu ns -wi 5 -f 1 -i 5 I get about 40 ns/op, which is at the lower end of the range for a single fsin call on modern x86 hardware. Let's take a peek at the assembly:

[Constants]
  0x00007ff2e4dbbd20 (offset:    0): 0x54442d18   0x3fe921fb54442d18
  0x00007ff2e4dbbd24 (offset:    4): 0x3fe921fb
  0x00007ff2e4dbbd28 (offset:    8): 0xf4f4f4f4   0xf4f4f4f4f4f4f4f4
  0x00007ff2e4dbbd2c (offset:   12): 0xf4f4f4f4
  0x00007ff2e4dbbd30 (offset:   16): 0xf4f4f4f4   0xf4f4f4f4f4f4f4f4
  0x00007ff2e4dbbd34 (offset:   20): 0xf4f4f4f4
  0x00007ff2e4dbbd38 (offset:   24): 0xf4f4f4f4   0xf4f4f4f4f4f4f4f4
  0x00007ff2e4dbbd3c (offset:   28): 0xf4f4f4f4
  (snip)
[Verified Entry Point]
  0x00007ff2e4dbbd50: sub     $0x28,%rsp
  0x00007ff2e4dbbd57: mov     %rbp,0x20(%rsp)   ;*synchronization entry
                                                ; - stackoverflow.TrigBench::testMethod@-1 (line 38)

  0x00007ff2e4dbbd5c: vmovsd  0x10(%rsi),%xmm2  ;*getfield i
                                                ; - stackoverflow.TrigBench::testMethod@1 (line 38)

  0x00007ff2e4dbbd61: vmovapd %xmm2,%xmm1
  0x00007ff2e4dbbd65: sub     $0x8,%rsp
  0x00007ff2e4dbbd69: vmovsd  %xmm1,(%rsp)
  0x00007ff2e4dbbd6e: fldl    (%rsp)
  0x00007ff2e4dbbd71: fsin
  0x00007ff2e4dbbd73: fstpl   (%rsp)
  0x00007ff2e4dbbd76: vmovsd  (%rsp),%xmm1
  0x00007ff2e4dbbd7b: add     $0x8,%rsp         ;*invokestatic sin
                                                ; - stackoverflow.TrigBench::testMethod@20 (line 40)

  0x00007ff2e4dbbd7f: vmovsd  0xffffff99(%rip),%xmm3  ;   {section_word}
  0x00007ff2e4dbbd87: vandpd  0xffe68411(%rip),%xmm2,%xmm0
                                                ;   {external_word}
  0x00007ff2e4dbbd8f: vucomisd %xmm0,%xmm3
  0x00007ff2e4dbbd93: jnb     0x7ff2e4dbbe4c
  0x00007ff2e4dbbd99: vmovq   %xmm3,%r13
  0x00007ff2e4dbbd9e: vmovq   %xmm1,%rbp
  0x00007ff2e4dbbda3: vmovq   %xmm2,%rbx
  0x00007ff2e4dbbda8: vmovapd %xmm2,%xmm0
  0x00007ff2e4dbbdac: movabs  $0x7ff2f9abaeec,%r10
  0x00007ff2e4dbbdb6: callq   %r10
  0x00007ff2e4dbbdb9: vmovq   %xmm0,%r14
  0x00007ff2e4dbbdbe: vmovq   %rbx,%xmm2
  0x00007ff2e4dbbdc3: vmovq   %rbp,%xmm1
  0x00007ff2e4dbbdc8: vmovq   %r13,%xmm3
  0x00007ff2e4dbbdcd: vandpd  0xffe683cb(%rip),%xmm2,%xmm0
                                                ;*invokestatic sin
                                                ; - stackoverflow.TrigBench::testMethod@4 (line 38)
                                                ;   {external_word}
  0x00007ff2e4dbbdd5: vucomisd %xmm0,%xmm3
  0x00007ff2e4dbbdd9: jnb     0x7ff2e4dbbe56
  0x00007ff2e4dbbddb: vmovq   %xmm3,%r13
  0x00007ff2e4dbbde0: vmovq   %xmm1,%rbp
  0x00007ff2e4dbbde5: vmovq   %xmm2,%rbx
  0x00007ff2e4dbbdea: vmovapd %xmm2,%xmm0
  0x00007ff2e4dbbdee: movabs  $0x7ff2f9abaeec,%r10
  0x00007ff2e4dbbdf8: callq   %r10
  0x00007ff2e4dbbdfb: vmovsd  %xmm0,(%rsp)
  0x00007ff2e4dbbe00: vmovq   %rbx,%xmm2
  0x00007ff2e4dbbe05: vmovq   %rbp,%xmm1
  0x00007ff2e4dbbe0a: vmovq   %r13,%xmm3        ;*invokestatic sin
                                                ; - stackoverflow.TrigBench::testMethod@12 (line 39)

  0x00007ff2e4dbbe0f: vandpd  0xffe68389(%rip),%xmm2,%xmm0
                                                ;*invokestatic sin
                                                ; - stackoverflow.TrigBench::testMethod@4 (line 38)
                                                ;   {external_word}
  0x00007ff2e4dbbe17: vucomisd %xmm0,%xmm3
  0x00007ff2e4dbbe1b: jnb     0x7ff2e4dbbe32
  0x00007ff2e4dbbe1d: vmovapd %xmm2,%xmm0
  0x00007ff2e4dbbe21: movabs  $0x7ff2f9abaeec,%r10
  0x00007ff2e4dbbe2b: callq   %r10
  0x00007ff2e4dbbe2e: vmovapd %xmm0,%xmm1       ;*invokestatic sin
                                                ; - stackoverflow.TrigBench::testMethod@20 (line 40)

  0x00007ff2e4dbbe32: vmovq   %r14,%xmm0
  0x00007ff2e4dbbe37: vaddsd  (%rsp),%xmm0,%xmm0
  0x00007ff2e4dbbe3c: vaddsd  %xmm0,%xmm1,%xmm0  ;*dadd
                                                ; - stackoverflow.TrigBench::testMethod@30 (line 41)

  0x00007ff2e4dbbe40: add     $0x20,%rsp
  0x00007ff2e4dbbe44: pop     %rbp
  0x00007ff2e4dbbe45: test    %eax,0x15f461b5(%rip)  ;   {poll_return}
  0x00007ff2e4dbbe4b: retq
  0x00007ff2e4dbbe4c: vmovq   %xmm1,%r14
  0x00007ff2e4dbbe51: jmpq    0x7ff2e4dbbdcd
  0x00007ff2e4dbbe56: vmovsd  %xmm1,(%rsp)
  0x00007ff2e4dbbe5b: jmp     0x7ff2e4dbbe0f

Right up front, we see that the generated code loads the field i into the x87 FP stack1 and uses an fsin instruction to calculate Math.sin(i).


The next part is also interesting:

  0x00007ff2e4dbbd7f: vmovsd  0xffffff99(%rip),%xmm3  ;   {section_word}
  0x00007ff2e4dbbd87: vandpd  0xffe68411(%rip),%xmm2,%xmm0
                                                ;   {external_word}
  0x00007ff2e4dbbd8f: vucomisd %xmm0,%xmm3
  0x00007ff2e4dbbd93: jnb     0x7ff2e4dbbe4c

The first instruction is loading the constant 0x3fe921fb54442d18, which is 0.785398..., also known as pi / 4. The second is vpanding the value i with some other constant. We then compare pi / 4 with the result of the vpand and jump somewhere if the latter is less than or equal to the former.

Huh? If you follow the jump, there are a series (redundant) vpandpd and vucomisd instructions against the same values (and using the same constant for the vpand), which fairly quickly leads to this sequence:

  0x00007ff2e4dbbe32: vmovq   %r14,%xmm0
  0x00007ff2e4dbbe37: vaddsd  (%rsp),%xmm0,%xmm0
  0x00007ff2e4dbbe3c: vaddsd  %xmm0,%xmm1,%xmm0  ;*dadd
  ...
  0x00007ff2e4dbbe4b: retq

That simply triples the value returned from the fsin call (which has been stashed away in r14 and [rsp] during the various jumps) and returns.

So we see here that the two redundant calls to Math.sin(i) have been eliminated in the case that the "jumps are taken", although the elimination still explicitly adds together all the values as if they were unique and does a bunch of redundant and and compare instructions.

If we don't take the jump, we get the same callq %r10 behavior you show in your disassembly.

What's going on here?


We will find enlightenment if we dig into the inline_trig call library_call.cpp in the hotspot JVM source. Near the start of this method, we see this (some code omitted for brevity):

  // Rounding required?  Check for argument reduction!
  if (Matcher::strict_fp_requires_explicit_rounding) {
    // (snip)

    // Pseudocode for sin:
    // if (x <= Math.PI / 4.0) {
    //   if (x >= -Math.PI / 4.0) return  fsin(x);
    //   if (x >= -Math.PI / 2.0) return -fcos(x + Math.PI / 2.0);
    // } else {
    //   if (x <=  Math.PI / 2.0) return  fcos(x - Math.PI / 2.0);
    // }
    // return StrictMath.sin(x);

    // (snip)

    // Actually, sticking in an 80-bit Intel value into C2 will be tough; it
    // requires a special machine instruction to load it.  Instead we'll try
    // the 'easy' case.  If we really need the extra range +/- PI/2 we'll
    // probably do the math inside the SIN encoding.

    // Make the merge point
    RegionNode* r = new RegionNode(3);
    Node* phi = new PhiNode(r, Type::DOUBLE);

    // Flatten arg so we need only 1 test
    Node *abs = _gvn.transform(new AbsDNode(arg));
    // Node for PI/4 constant
    Node *pi4 = makecon(TypeD::make(pi_4));
    // Check PI/4 : abs(arg)
    Node *cmp = _gvn.transform(new CmpDNode(pi4,abs));
    // Check: If PI/4 < abs(arg) then go slow
    Node *bol = _gvn.transform(new BoolNode( cmp, BoolTest::lt ));
    // Branch either way
    IfNode *iff = create_and_xform_if(control(),bol, PROB_STATIC_FREQUENT, COUNT_UNKNOWN);
    set_control(opt_iff(r,iff));

    // Set fast path result
    phi->init_req(2, n);

    // Slow path - non-blocking leaf call
    Node* call = NULL;
    switch (id) {
    case vmIntrinsics::_dsin:
      call = make_runtime_call(RC_LEAF, OptoRuntime::Math_D_D_Type(),
                               CAST_FROM_FN_PTR(address, SharedRuntime::dsin),
                               "Sin", NULL, arg, top());
      break;

      break;
    }

Basically, there is a fast path and slow path for the trig methods - if the argument of sin is larger than Math.PI / 4 we use the slow path. The check involves a Math.abs call, which is what the mysterious vandpd 0xffe68411(%rip),%xmm2,%xmm0 was doing: it was clearing the top bit which is a quick way to do abs for floating point values in SSE or AVX registers.

Now the rest of the code makes sense too: most of the code we see is the three fast paths after optimization: the two redundant fsin calls have been eliminated, but the surrounding checks haven't. This is probably just a limitation of the optimizer: either the optimizer just isn't strong enough to eliminate everything, or the expansion of these intrinsic methods happens after the optimization phase that would have combined them2.

On the slow path, we do the make_runtime_call call thing, which shows up as a callq %r10. This is a so called stub method call which internally will implement sin, including the "argument reduction" concern mentioned in the comments. On my system, the slow path is not necessarily much slower than the fast path: if you change the - to a + in the initialization of i:

private double i = Math.PI / 4 - 0.01;

you invoke the slow path, which for a single Math.sin(i) call takes ~50 ns versus 40 ns for the fast path3. The problem occurs with the optimization of the three redundant Math.sin(i) calls. As we see from the above source, the callq %r10 occurs three times (and by tracing through the execution path we see that they are all taken once the first jump falls though). This means the runtime is about 150 ns for the three calls, or almost 4x the fast path case.

Evidently, the JDK cannot combine the runtime_call nodes in this case, even though they are for identical arguments. Most likely the runtime_call nodes in the internal representation are relatively opaque and not subject to CSE and other optimizations that would help. These calls are largely used for intrinsic expansion and some internal JVM methods, and aren't really going to be key targets for this type of optimization, so this approach seems reasonable.

Recent Java 9

All of this changed in Java 9 with this change.

The "fast path" where fsin was directly inlined was removed. My use of quotes around "fast path" here is deliberate: there is certainly reason to believe to that SSE or AVX-aware software sin methods could be faster than the x87 fsin which hasn't gotten much love in over a decade. Indeed, this change is replacing the fsin calls "using Intel LIBM implementation" (here is the algorithm in its full glory for those that are interested).

Great, so maybe it's faster now (maybe - the OP didn't provide numbers, even after requested, so we don't know) - but the side effect is that without inlining, we always explicitly make a call for each Math.sin and Math.cos that appears in the source: no CSE occurs.

You could probably file this as a hotspot bug, especially since it can be positioned as a regression - although I suspect the use cases where known-identical arguments are repeatedly passed to trig functions are very slim. Even legitimate performance bugs, clearly explained and documented often languish for years (unless of course you have a paid support contract with Oracle - then the languishing is somewhat less).


1 Actually in quite a silly, roundabout way: it starts in memory at [rsi + 0x10] and then it loads it from there into xmm2, then does a reg-reg move into xmm1 and stores it back to memory at the top of the stack (vmovsd %xmm1,(%rsp)), then finally loads it into the x87 FP stack with fldl (%rsp). Of course, it could have just loaded it directly from it's original location at [rsp + 0x10] with a single fld! This probably adds 5 cycles or more to the total latency.

2 It should be noted though that the fsin instruction dominates the runtime here, so the extra stuff doesn't really add anything to the runtime: if you reduce the method to a single return Math.sin(i); line the runtime is about the same at 40ns.

3 At least for arguments close to Math.PI / 4. Outside that range, the timing various - being very fast for values close to pi / 2 (about 40 ns - as fast as the "fast path") and generally around 65 ns for very large values, which probably do the reduction via division/mod.

Organogenesis answered 1/1, 2018 at 4:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.