How can I write code to hint to the JVM to use vector operations?

Asked 3/5, 2014 at 22:12 Answered 14/11, 2016 at 7:37

Somewhat related question, and a year old: Do any JVM's JIT compilers generate code that uses vectorized floating point instructions?

Preface: I am trying to do this in pure java (no JNI to C++, no GPGPU work, etc...). I have profiled and the bulk of the processing time is coming from the math operations in this method (which is probably 95% floating point math and 5% integer math). I've already reduced all Math.xxx() calls to an approximation that's good enough so most of the math is now floating point multiplies with a few adds.

I have some code that deals with audio processing. I've been making tweaks and have already come across great gains. Now I'm looking into manual loop unrolling to see if there's any benefit (at least with a manual unroll of 2, I am seeing approximately a 25% improvement). While trying my hand at a manual unroll of 4 (which is starting to get very complicated since I am unrolling both loops of a nested loop) I am wondering if there's anything I can do to hint to the jvm that at runtime it can use vector operations (e.g. SSE2, AVX, etc...). Each sample of the audio can be calculated completely independently of other samples, which is why I've been able to see a 25% improvement already (reducing the amount of dependencies on floating point calculations).

For example, I have 4 floats, one for each of the 4 unrolls of the loop to hold a partially computed value. Does how I declare and use these floats matter? If I make it a float[4] does that hint to the jvm that they are unrelated to each other vs having float,float,float,float or even a class of 4 public floats? Is there something I can do without meaning to that will kill my chance at code being vectorized?

I've come across articles online about writing code "normally" because the compiler/jvm knows the common patterns and how to optimize them and deviating from the patterns can mean less optimization. At least in this case however, I wouldn't have expected unrolling the loops by 2 to have improved performance by as much as it did so I'm wondering if there's anything else I can do (or at least not do) to help my chances. I know that the compiler/jvm are only going to get better so I also want to be wary of doing things that will hurt me in the future.

Edit for the curious: unrolling by 4 increased performance by another ~25% over unrolling by 2, so I really think vector operations would help in my case if the jvm supported it (or perhaps already is using them).

Thanks!

Haemolysis answered 3/5, 2014 at 22:12 Comment(8)

1. I don't think that unrolling the outer loop makes any sense if the inner loop is repeated many times. 2. The JVM itself does a lot of unrolling, but sometimes it's unable to take advantage of it. This question shows a nearly 4 times improvement in a trivial case. 3. Writing clear and simple code is the right thing 99.9% of time, but if you know what you're doing and try hard and are prepared for the maintainance costs then you can do better than the JIT. – Tonicity 4/5, 2014 at 2:49

@Tonicity Thanks for the link. The inner loop is executed usually between 10 and 300 times depending upon some user choices (usually on the lower end around 30-40). The outer loop however is executed tens or hundreds of thousands of times. I tried unrolling only the inner loop, but it increased the execution time. I tried only unrolling the outer loop and it did decrease execution time, but only by a little bit. I guess the CPU can really pick out the dependency chains easier when both are unrolled. – Haemolysis 4/5, 2014 at 5:17

My idea was: When the inner loop executes 10 times, then each instruction inside of it weights 10 times as much as an instruction outside; therefore I'd ignore the outside. I can only guess, but I'd bet that the JVM unrolls anyway (sometimes even too much) and the slowdown is just a JIT glitch (where it optimizes two equivalent chunks differently and the resulting timings differ a lot). I'm afraid looking at the generated assembler is the way to go. – Tonicity 4/5, 2014 at 6:11

You explicitly excluded GPGPU (and I wonder why...), but maybe you should nevertheless have a short look at code.google.com/p/aparapi : It will use either OpenCL for the computation (and this can be either a GPU or the CPU!), or Java Thread Pools when no OpenCL is available. – Habergeon 4/5, 2014 at 18:4

@Habergeon I am excluding GPGPU options because while yes, this is pretty much a perfect candidate for GPU work, it negates the huge benefit of "write once run anywhere" that java has going for it. I'd like to make this as performant as possible in pure java. The goal isn't "I need this as time efficient as possible?" (or I'd write it in C and use CUDA/OpenCL), but "how can I make this as time efficient in java as possible?" And one of the questions I had about that was if there was a way to hint to the jvm that vector optimizations will help. Appreciate the link though! I hadn't heard of it. – Haemolysis 5/5, 2014 at 2:3

@Habergeon Ah I see that aparapi DOES use the "write once run anywhere" paradigm - I thought you would need to write multiple kernels for different GPUs at first. I'll take another look. I'm already using thread pools on a high level (as there are many of these operations that need to be done), so it might take some refactoring where each thread gets a sample to calculate rather than a chunk. Thanks! – Haemolysis 5/5, 2014 at 2:8

Note that Aparapi is under heavy development - which is good, but I didn't manage to actively follow the development recently, so I'm not sure how "production ready" it is. But it's for sure the by far most mature approach for compiling Java to the GPU. It's even considered as a "tracing bullet" for openjdk.java.net/projects/sumatra , and the lead developer of Aparapi is also part of the Sumatra crew. – Habergeon 5/5, 2014 at 20:4

Also note that the JIT compilation is controlled by JVM options set at launch time (vendor specific). You may consider learning all of these, and even compare IBM to Oracle to Azul to Gjc to see if one does the job better than the rest. – Frigg 11/5, 2014 at 7:20

How can I..audio processing..pure java (no JNI to C++, no GPGPU work, etc...)..use vector operations (e.g. SSE2, AVX, etc...)

Java is high level language (one instruction in Java generates many hardware instructions) which is by-design (e.g. garbage collector memory management) not suitable for tasks that manipulate high data volumes in real time.

There are usually special pieces of hardware optimized for particular role (e.g. image processing or speech recognition) that many times utilize parallelization through several simplified processing pipelines.

There are also special programming languages for this sort of tasks, mainly hardware description languages and assembly language.

Even C++ (considered the fast language) will not automagically use some super optimized hardware operations for you. It may just inline one of several hand-crafted assembly language methods at certain places.

So my answer is that there is "probably no way" to instruct JVM to use some hardware optimization for your code (e.g. SSE) and even if there was some then the Java language runtime would still have too many other factors that will slow-down your code.

Use a low-level language designed for this task and link it to the Java for high-level logic.

EDIT: adding some more info based on comments

If you are convinced that high-level "write once run anywhere" language runtime definitely should also do lots of low level optimizations for you and turn automagically your high-level code into optimized low-level code then...the way JIT compiler optimizes depends on the implementation of the Java Virtual Machine. There are many of them.

In case of Oracle JVM (HotSpot) you can start looking for your answer by downloading the source code, text SSE2 appears in following files:

openjdk/hotspot/src/cpu/x86/vm/assembler_x86.cpp
openjdk/hotspot/src/cpu/x86/vm/assembler_x86.hpp
openjdk/hotspot/src/cpu/x86/vm/c1_LIRGenerator_x86.cpp
openjdk/hotspot/src/cpu/x86/vm/c1_Runtime1_x86.cpp
openjdk/hotspot/src/cpu/x86/vm/sharedRuntime_x86_32.cpp
openjdk/hotspot/src/cpu/x86/vm/vm_version_x86.cpp
openjdk/hotspot/src/cpu/x86/vm/vm_version_x86.hpp
openjdk/hotspot/src/cpu/x86/vm/x86_32.ad
openjdk/hotspot/src/os_cpu/linux_x86/vm/os_linux_x86.cpp
openjdk/hotspot/src/share/vm/c1/c1_GraphBuilder.cpp
openjdk/hotspot/src/share/vm/c1/c1_LinearScan.cpp
openjdk/hotspot/src/share/vm/runtime/globals.hpp

They're in C++ and assembly language so you will have to learn some low level languages to read them anyway.

I would not hunt that deep even with +500 bounty. IMHO the question is wrong based on wrong assumptions

Tenfold answered 4/5, 2014 at 12:21 Comment(2)

I know that Java is a high-level language and I know that you can use JNI to call into C/C++, but I said that it wasn't an option. Java is "write once run anywhere" and using JNI completely negates that huge benefit. Also, there are a few compilers for C++ that WILL "automagically" use super optimized operations - it's called automatic vectorization and some compilers (e.g. the Intel compiler) are pretty good at it (though obviously isn't perfect). I appreciate your response and you taking the time to link to several different topics, but it does not really answer my question. – Haemolysis 5/5, 2014 at 1:56

Also, I know that the compiled code won't be directly compiled into e.g. SSE instructions, but since this is a hot spot in the code path it's a great candidate for JIT compilation, and I was wondering if there's anything I can do such that the jvm will see that this section should be jitted and that it can utilize vector operations. – Haemolysis 5/5, 2014 at 1:59

SuperWord optimizations on Hotspot are limited and quite fragile. Limited since they are generally behind what a C/C++ compiler offers, and fragile since they depend on particular loop shapes (and are only supported for certain CPUs).

I understand you want to write once run anywhere. It sounds like you already have a pure Java solution. You might want to consider an optional implementation for known popular platforms to supplement that implementation to "fast in some places" which is already true probably.

It's hard to give you more concrete feedback with some code. I suggest you take the loop in question and present it in a JMH benchmark. This makes it easy to analyze and discuss.

Thacker answered 14/11, 2016 at 7:37 Comment(0)

Recommended topics

Hot tags