Do any JVM's JIT compilers generate code that uses vectorized floating point instructions?
Asked Answered
E

9

100

Let's say the bottleneck of my Java program really is some tight loops to compute a bunch of vector dot products. Yes I've profiled, yes it's the bottleneck, yes it's significant, yes that's just how the algorithm is, yes I've run Proguard to optimize the byte code, etc.

The work is, essentially, dot products. As in, I have two float[50] and I need to compute the sum of pairwise products. I know processor instruction sets exist to perform these kind of operations quickly and in bulk, like SSE or MMX.

Yes I can probably access these by writing some native code in JNI. The JNI call turns out to be pretty expensive.

I know you can't guarantee what a JIT will compile or not compile. Has anyone ever heard of a JIT generating code that uses these instructions? and if so, is there anything about the Java code that helps make it compilable this way?

Probably a "no"; worth asking.

Eskil answered 28/5, 2012 at 12:48 Comment(8)
The easiest way to find out is probably to get the most modern JIT you can find and have it output the generated assembly with -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:+LogCompilation. You'll need a program that runs the vectorizable method enough times to make it "hot."Pessimist
Or have a look at the source. download.java.net/openjdk/jdk7Fructiferous
"Coming soon" to a jdk near you: mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2012-July/…Kendakendal
Actually, according to this blog, JNI can be rather fast if used "correctly".Yawl
A relevant blog post on this can be found here: psy-lob-saw.blogspot.com/2015/04/… with the general message that vectorization can happen, and does happen. Apart from vectorizing specific cases (Arrays.fill()/equals(char[])/arrayCopy) the JVM auto-vectorizes using Superword Level Parallelization. The relevant code is in superword.cpp and the paper its based on is here: groups.csail.mit.edu/cag/slp/SLP-PLDI-2000.pdfDivisible
@NitsanWakart Thank you for the blog. However, do you have information on when the SuperWord autovectorization kicks in for more typical code? What ISAs are supported? Etc.Fash
@AleksandrDubinsky SuperWord coverage is expanded all the time. AFAIK at the moment it will optimize only simple array accessing loops of the type: "for(int i=0;i<a.length;i++) a[i] = f(b[i]);", or code which can be reduced to such a loop. I'm not sure what you mean by typical code...Divisible
medium.com/itnext/… - Answer is: 'yes', and the JIT got pretty clever...Proteiform
E
46

So, basically, you want your code to run faster. JNI is the answer. I know you said it didn't work for you, but let me show you that you are wrong.

Here's Dot.java:

import java.nio.FloatBuffer;
import org.bytedeco.javacpp.*;
import org.bytedeco.javacpp.annotation.*;

@Platform(include = "Dot.h", compiler = "fastfpu")
public class Dot {
    static { Loader.load(); }

    static float[] a = new float[50], b = new float[50];
    static float dot() {
        float sum = 0;
        for (int i = 0; i < 50; i++) {
            sum += a[i]*b[i];
        }
        return sum;
    }
    static native @MemberGetter FloatPointer ac();
    static native @MemberGetter FloatPointer bc();
    static native @NoException float dotc();

    public static void main(String[] args) {
        FloatBuffer ab = ac().capacity(50).asBuffer();
        FloatBuffer bb = bc().capacity(50).asBuffer();

        for (int i = 0; i < 10000000; i++) {
            a[i%50] = b[i%50] = dot();
            float sum = dotc();
            ab.put(i%50, sum);
            bb.put(i%50, sum);
        }
        long t1 = System.nanoTime();
        for (int i = 0; i < 10000000; i++) {
            a[i%50] = b[i%50] = dot();
        }
        long t2 = System.nanoTime();
        for (int i = 0; i < 10000000; i++) {
            float sum = dotc();
            ab.put(i%50, sum);
            bb.put(i%50, sum);
        }
        long t3 = System.nanoTime();
        System.out.println("dot(): " + (t2 - t1)/10000000 + " ns");
        System.out.println("dotc(): "  + (t3 - t2)/10000000 + " ns");
    }
}

and Dot.h:

float ac[50], bc[50];

inline float dotc() {
    float sum = 0;
    for (int i = 0; i < 50; i++) {
        sum += ac[i]*bc[i];
    }
    return sum;
}

We can compile and run that with JavaCPP using this command:

$ java -jar javacpp.jar Dot.java -exec

With an Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz, Fedora 30, GCC 9.1.1, and OpenJDK 8 or 11, I get this kind of output:

dot(): 39 ns
dotc(): 16 ns

Or roughly 2.4 times faster. We need to use direct NIO buffers instead of arrays, but HotSpot can access direct NIO buffers as fast as arrays. On the other hand, manually unrolling the loop does not provide a measurable boost in performance, in this case.

Epiphysis answered 30/5, 2012 at 2:6 Comment(21)
Did you use OpenJDK or Oracle HotSpot? Contrary to popular belief, they are not the same.Kendakendal
@exabrial This is what "java -version" returns on this machine right now: java version "1.6.0_22" OpenJDK Runtime Environment (IcedTea6 1.10.6) (fedora-63.1.10.6.fc15-x86_64) OpenJDK 64-Bit Server VM (build 20.0-b11, mixed mode)Epiphysis
Nice one -- some of the advanced tricks here may be the key to speed. I'll have to investigate whether I can use them.Eskil
@Sean What kind of restrictions are you facing?Epiphysis
Nothing particularly restrictive -- I just don't know what JavaCPP entails. If it's as simple as it appears, no problem.Eskil
@Sean I like to think that JavaCPP is simple, that's why I made it and use it, but it's still C++ JNI code that gets compiled. If for any reason JavaCPP isn't acceptable, please refer to the code it generates, and treat it as part of this answer! :)Epiphysis
@Samuel Audet You're not running the full version of Java, just the open source parts. There differences are minor, and it won't likely fix your issue, but it does perform better overall oracle.com/technetwork/java/javase/downloadsKendakendal
@exabrial If you reread the question again, you will understand that the problem is that no version of HotSpot compiles Java to vector instructions like GCC does for C++. That's why the JNI version is faster: It's because the C++ loop gets vectorized.Epiphysis
@SamuelAudet I fully understand that and I also acknowledge that any openJDK JVM is unlikely to make that optimization. However, since you're trying to squeeze every last drop of performance out of the system, the Oracle JVM does perform faster than the OpenJDK one.Kendakendal
That loop likely has a carried loop dependency. You may get a further speedup by unrolling the loop two or more times.Varney
@raxman Thanks for pointing this out. Unfortunately, it looks like staying in Java becomes the most efficient solution :)Epiphysis
In your code you did not use SSE or MMX or NEON instructions. In that case, it may be favorable to use JNI even for such small vectors. For larger vectors, JNI is the favorite.Freestyle
@Freestyle GCC vectorizes the code with SSE, yes, but for such small data, JNI call overhead is unfortunately too large.Epiphysis
@exabrial All of my testing shows OpenJDK 6 was pretty far behind Oracle 6 and OpenJDK 7. If you care about performance, I wouldn't test with less than OpenJDK 7 or Oracle 6 (and 8 might not be perfectly optimized yet).Boiling
On my A6-7310 with JDK 13, I get: dot(): 69 ns / dotc(): 95 ns. Java wins!Net
@StefanReich Could you try annotating dotc() with @NoException? Newer versions of JavaCPP try to catch C++ exceptions by default...Epiphysis
@SamuelAudet Same numbers... 70 / 91 this time.Net
@StefanReich I've updated the results with more recent software (OpenJDK 8 or 11 and GCC 9.1.1) and for me dotc() got faster, but dot() didn't.Epiphysis
FWIW, even with OpenJDK 13 Early-Access 33 the results are the same for me. If there's some experimental mode that we should be activating, please let me know! ThanksEpiphysis
@SamuelAudet It might be some anomaly in my notebook. On my servers, the results are in favor of C as usual. I've made a JVM 13 debug build and will look at the assembly code.Net
@StefanReich I guess, it's the lack of AVX2 instruction set or inability of the compiler to detect AVX availability at all.Pembrook
D
43

To address some of the scepticism expressed by others here I suggest anyone who wants to prove to themselves or other use the following method:

  • Create a JMH project
  • Write a small snippet of vectorizable math.
  • Run their benchmark flipping between -XX:-UseSuperWord and -XX:+UseSuperWord(default)
  • If no difference in performance is observed, your code probably didn't get vectorized
  • To make sure, run your benchmark such that it prints out the assembly. On linux you can enjoy the perfasm profiler('-prof perfasm') have a look and see if the instructions you expect get generated.

Example:

@Benchmark
@CompilerControl(CompilerControl.Mode.DONT_INLINE) //makes looking at assembly easier
public void inc() {
    for (int i=0;i<a.length;i++)
        a[i]++;// a is an int[], I benchmarked with size 32K
}

The result with and without the flag (on recent Haswell laptop, Oracle JDK 8u60): -XX:+UseSuperWord : 475.073 ± 44.579 ns/op (nanoseconds per op) -XX:-UseSuperWord : 3376.364 ± 233.211 ns/op

The assembly for the hot loop is a bit much to format and stick in here but here's a snippet(hsdis.so is failing to format some of the AVX2 vector instructions so I ran with -XX:UseAVX=1): -XX:+UseSuperWord(with '-prof perfasm:intelSyntax=true')

  9.15%   10.90%  │││ │↗    0x00007fc09d1ece60: vmovdqu xmm1,XMMWORD PTR [r10+r9*4+0x18]
 10.63%    9.78%  │││ ││    0x00007fc09d1ece67: vpaddd xmm1,xmm1,xmm0
 12.47%   12.67%  │││ ││    0x00007fc09d1ece6b: movsxd r11,r9d
  8.54%    7.82%  │││ ││    0x00007fc09d1ece6e: vmovdqu xmm2,XMMWORD PTR [r10+r11*4+0x28]
                  │││ ││                                                  ;*iaload
                  │││ ││                                                  ; - psy.lob.saw.VectorMath::inc@17 (line 45)
 10.68%   10.36%  │││ ││    0x00007fc09d1ece75: vmovdqu XMMWORD PTR [r10+r9*4+0x18],xmm1
 10.65%   10.44%  │││ ││    0x00007fc09d1ece7c: vpaddd xmm1,xmm2,xmm0
 10.11%   11.94%  │││ ││    0x00007fc09d1ece80: vmovdqu XMMWORD PTR [r10+r11*4+0x28],xmm1
                  │││ ││                                                  ;*iastore
                  │││ ││                                                  ; - psy.lob.saw.VectorMath::inc@20 (line 45)
 11.19%   12.65%  │││ ││    0x00007fc09d1ece87: add    r9d,0x8            ;*iinc
                  │││ ││                                                  ; - psy.lob.saw.VectorMath::inc@21 (line 44)
  8.38%    9.50%  │││ ││    0x00007fc09d1ece8b: cmp    r9d,ecx
                  │││ │╰    0x00007fc09d1ece8e: jl     0x00007fc09d1ece60  ;*if_icmpge

Have fun storming the castle!

Divisible answered 17/6, 2013 at 8:0 Comment(8)
From the same paper: "the JITed disassembler output suggests that it is not actually that efficient in terms of calling the most optimal SIMD instructions and their scheduling. A quick hunt through the JVM JIT compiler (Hotspot) source code suggests that this is due to the non-existence of packed SIMD instruction codes." The SSE registers are being used in scalar mode.Fash
@AleksandrDubinsky some cases are covered, some aren't. Do you have a concrete case you are interested in?Divisible
Let's flip the question and ask whether the JVM will autovectorize any arithmetic operations? Can you provide an example? I do have a loop that I had to pull out and rewrite using intrinsics recently. However, rather than hope for autovectorization, I would like to see support for explicit vectorization/intrinsics (similar to agner.org/optimize/vectorclass.pdf). Even better would be to write a good Java backend for Aparapi (although the leadership of that project has some wrong goals). Do you work on the JVM?Fash
@AleksandrDubinsky This is becoming a long discussion for the comments... you can contact me on e-mail nitsanw AT yahoo DOTCOM and I'll try help from thereDivisible
@AleksandrDubinsky I hope the expanded answer helps, if not perhaps an email would. Also note that "rewrite using intrinsics" implies you changed the JVM code to add new intrinsics, is that what you mean? I'm guessing you meant replacing your Java code with calls into a native implementation via JNIDivisible
Thank you. This should now be the official answer. I think you should remove the reference to the paper, since it is outdated and doesn't demonstrate vectorization.Fash
the (currently) relevant JVM source code: hg.openjdk.java.net/jdk9/jdk9/hotspot/file/30e1996bd55d/src/…Nasho
The link to the paper is now dead.Fash
A
26

In HotSpot versions beginning with Java 7u40, the server compiler provides support for auto-vectorisation. According to JDK-6340864

However, this seems to be true only for "simple loops" - at least for the moment. For example, accumulating an array cannot be vectorised yet JDK-7192383

Allanallana answered 28/11, 2013 at 13:30 Comment(7)
Vectorization is there in JDK6 as well for some cases, though the targeted SIMD instruction set is not as wide.Divisible
Compiler vectorization support in HotSpot was improved a lot lately (June 2017) due to contributions by Intel. Performance-wise the yet unreleased jdk9 (b163 and later) currently wins over jdk8 due to bug-fixes enabling AVX2. Loops must fulfill a few constraints for auto-vectorization to work, e.g. use: int counter, constant counter increment, one termination condition with loop-invariant variables, loop body without method calls(?), no manual loop unfolding! Details are available in: cr.openjdk.java.net/~vlivanov/talks/…Allanallana
Vectorized fused-multiple-add (FMA) support does not look good currently (as of June 2017): it's either vectorization or scalar FMA(?). However, Oracle has apparently just accepted Intel's contribution to the HotSpot that enables FMA vectorization using AVX-512. To the delight of auto-vectorization fans and those lucky ones to have access to AVX-512 hardware, this may (with some luck) appear in one of the next jdk9 EA builds (beyond b175).Allanallana
A link to support the previous statement (RFR(M): 8181616: FMA Vectorization on x86): mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2017-June/…Allanallana
A small benchmark demonstrating acceleration by a factor of 4 on integers through loop vectorization using AVX2 instructions: prestodb.rocks/code/simdAllanallana
Vedran, why do you write comments to your answer instead of editing your answer? Edit your answer. The slides you linked to are very good and contain a lot of details.Fash
FMA does not require AVX-512. HotSpot will use FMA3, which is available on AVX2 CPUs since Haswell. It did not make it into Java 9, but with Oracle's plan to release Java every 6 months, we might get to use it soon enough.Fash
M
6

Here is nice article about experimenting with Java and SIMD instructions written by my friend: http://prestodb.rocks/code/simd/

Its general outcome is that you can expect JIT to use some SSE operations in 1.8 (and some more in 1.9). Though you should not expect much and you need to be careful.

Moneychanger answered 28/6, 2017 at 20:9 Comment(1)
It would help if you summarized some key insights of the article you linked to.Fash
T
4

You could write OpenCl kernel to do the computing and run it from java http://www.jocl.org/.

Code can be run on CPU and/or GPU and OpenCL language supports also vector types so you should be able to take explicitly advantage of e.g. SSE3/4 instructions.

Thiele answered 28/2, 2013 at 15:2 Comment(0)
S
4

Have a look at Performance comparison between Java and JNI for optimal implementation of computational micro-kernels. They show that Java HotSpot VM server compiler supports auto-vectorization using Super-word Level Parallelism, which is limited to simple cases of inside the loop parallelism. This article will also give you some guidance whether your data size is large enough to justify going JNI route.

Shouldst answered 5/12, 2015 at 23:11 Comment(0)
M
3

I'm guessing you wrote this question before you found out about netlib-java ;-) it provides exactly the native API you require, with machine optimised implementations, and does not have any cost at the native boundary due thanks to memory pinning.

Misesteem answered 1/12, 2015 at 16:57 Comment(1)
Yeah, long time ago. I was more hoping to hear that this is automagically translated to vectorized instructions. But clearly it's not that hard to make it happen manually.Eskil
F
3

Java 16 introduced the Vector API (JEP 417, JEP 414, JEP 338). It is currently "incubating" (ie, beta), although anyone can use it. It will probably become GA in Java 19 or 20.

It's a little verbose, but is meant to be reliable and portable.

The following code can be rewritten:

void scalarComputation(float[] a, float[] b, float[] c) {
   assert a.length == b.length && b.length == c.length;
   for (int i = 0; i < a.length; i++) {
        c[i] = (a[i] * a[i] + b[i] * b[i]) * -1.0f;
   }
}

Using the Vector API:

static final VectorSpecies<Float> SPECIES = FloatVector.SPECIES_PREFERRED;

void vectorComputation(float[] a, float[] b, float[] c) {
    assert a.length == b.length && b.length == c.length;
    int i = 0;
    int upperBound = SPECIES.loopBound(a.length);
    for (; i < upperBound; i += SPECIES.length()) {
        // FloatVector va, vb, vc;
        var va = FloatVector.fromArray(SPECIES, a, i);
        var vb = FloatVector.fromArray(SPECIES, b, i);
        var vc = va.mul(va)
                   .add(vb.mul(vb))
                   .neg();
        vc.intoArray(c, i);
    }
    for (; i < a.length; i++) {
        c[i] = (a[i] * a[i] + b[i] * b[i]) * -1.0f;
    }
}

Newer builds (ie, Java 18) are trying to get rid of that last for loop using predicate instructions, but support for that is still supposedly spotty.

Fash answered 21/12, 2021 at 17:5 Comment(0)
F
-6

I dont believe most if any VMs are ever smart enough for this sort of optimisations. To be fair most optimisations are much simpler, such as shifting instead of multiplication whena power of two. The mono project introduced their own vector and other methods with native backings to help performance.

Former answered 30/5, 2012 at 2:40 Comment(1)
Currently, no Java hotspot compiler does this, but it isn't much harder than things that they do do. They do use SIMD instructions to copy multiple array values at once. You just have to write some more pattern matching and code generation code, which is pretty straightforward after doing some loop unrolling. I think the people at Sun just got lazy, but it looks like it will now happen at Oracle (yay Vladimir! this should help our code a lot!): mail.openjdk.java.net/pipermail/hotspot-compiler-dev/…Glossy

© 2022 - 2024 — McMap. All rights reserved.