I'm trying to find a concise example which shows auto vectorization in java on a x86-64 system.
I've implemented the below code using y[i] = y[i] + x[i]
in a for loop. This code can benefit from auto vectorization, so I think java should compile it at runtime using SSE or AVX instructions to speed it up.
However, I couldn't find the vectorized instructions in the resulting native machine code.
VecOpMicroBenchmark.java
should benefit from auto vectorization:
/**
* Run with this command to show native assembly:<br/>
* java -XX:+UnlockDiagnosticVMOptions
* -XX:CompileCommand=print,VecOpMicroBenchmark.profile VecOpMicroBenchmark
*/
public class VecOpMicroBenchmark {
private static final int LENGTH = 1024;
private static long profile(float[] x, float[] y) {
long t = System.nanoTime();
for (int i = 0; i < LENGTH; i++) {
y[i] = y[i] + x[i]; // line 14
}
t = System.nanoTime() - t;
return t;
}
public static void main(String[] args) throws Exception {
float[] x = new float[LENGTH];
float[] y = new float[LENGTH];
// to let the JIT compiler do its work, repeatedly invoke
// the method under test and then do a little nap
long minDuration = Long.MAX_VALUE;
for (int i = 0; i < 1000; i++) {
long duration = profile(x, y);
minDuration = Math.min(minDuration, duration);
}
Thread.sleep(10);
System.out.println("\n\nduration: " + minDuration + "ns");
}
}
To find out if it gets vectorized, I did the following:
- open eclipse and create the above file
- right-click the file and from the dropdown menu, choose Run > Java Application (ignore the output for now)
- in the eclipse menu, click Run > Run Configurations...
- in the opened window, find VecOpMicroBenchmark, click it and choose the Arguments tab
- in the Arguments tab, under VM arguments: put in this:
-XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,VecOpMicroBenchmark.profile
- get libhsdis and copy (possibly rename) the file
hsdis-amd64.so
(.dll for windows) to java/lib directory. In my case, this was/usr/lib/jvm/java-11-openjdk-amd64/lib
. - run VecOpMicroBenchmark again
It should now print lots of information to the console, part of it being the disassembled native machine code, which was produced by the JIT compiler. If you see lots of messages, but no assembly instructions like mov
, push
, add
, etc, then maybe you can somewhere find the following message:
Could not load hsdis-amd64.so; library not loadable; PrintAssembly is disabled
This means that java couldn't find the file hsdis-amd64.so
- it's not in the right directory or it doesn't have the right name.
hsdis-amd64.so
is the disassembler which is required for showing the resulting native machine code. After the JIT compiler compiles the java bytecode to native machine code, hsdis-amd64.so
is used to disassemble the native machine code to make it human readable. You can find more infos on how to get/install it at How to see JIT-compiled code in JVM? .
After finding assembly instructions in the output, I skimmed through it (too much to post all of it here) and looked for line 14
. I found this:
0x00007fac90ee9859: nopl 0x0(%rax)
0x00007fac90ee9860: cmp 0xc(%rdx),%esi ; implicit exception: dispatches to 0x00007fac90ee997f
0x00007fac90ee9863: jnb 0x7fac90ee9989
0x00007fac90ee9869: movsxd %esi,%rbx
0x00007fac90ee986c: vmovss 0x10(%rdx,%rbx,4),%xmm0 ;*faload {reexecute=0 rethrow=0 return_oop=0}
; - VecOpMicroBenchmark::profile@16 (line 14)
0x00007fac90ee9872: cmp 0xc(%rdi),%esi ; implicit exception: dispatches to 0x00007fac90ee9997
0x00007fac90ee9875: jnb 0x7fac90ee99a1
0x00007fac90ee987b: movsxd %esi,%rbx
0x00007fac90ee987e: vmovss 0x10(%rdi,%rbx,4),%xmm1 ;*faload {reexecute=0 rethrow=0 return_oop=0}
; - VecOpMicroBenchmark::profile@20 (line 14)
0x00007fac90ee9884: vaddss %xmm1,%xmm0,%xmm0
0x00007fac90ee9888: movsxd %esi,%rbx
0x00007fac90ee988b: vmovss %xmm0,0x10(%rdx,%rbx,4) ;*fastore {reexecute=0 rethrow=0 return_oop=0}
; - VecOpMicroBenchmark::profile@22 (line 14)
So it's using the AVX instruction vaddss
. But, if I'm correct here, vaddss
means
add scalar single-precision floating-point values and this only adds one float value to another one (here, scalar means just one, whereas here single means 32 bit, i.e. float
and not double
).
What I expect here is vaddps
, which means add packed single-precision floating-point values and which is a true SIMD instruction (SIMD = single instruction, multiple data = vectorized instruction). Here, packed means multiple floats packed together in one register.
About the ..ss and ..ps, see http://www.songho.ca/misc/sse/sse.html :
SSE defines two types of operations; scalar and packed. Scalar operation only operates on the least-significant data element (bit 0~31), and packed operation computes all four elements in parallel. SSE instructions have a suffix -ss for scalar operations (Single Scalar) and -ps for packed operations (Parallel Scalar).
Queston:
Is my java example incorrect, or why is there no SIMD instruction in the output?
LENGTH
. Apparently each element access checks that the index is within the bounds of the respective array and it throws an exception if not. This may very well disable vectorization. – Manfullea
for shift-and-add into a new destination. – Hubey