How can I code Java to allow SSE use and bounds-check elimination (or other advanced optimizations)?

Asked 29/8, 2009 at 21:28 Answered 8/12, 2009 at 5:38

Solved java performance optimization jvm-hotspot bounds-check-elimination

The Situation:

I'm optimizing a pure-java implementation of the LZF compression algorithm, which involves a lot of byte[] access and basic int mathematics for hashing and comparison. Performance really matters, because the goal of the compression is to reduce I/O requirements. I am not posting code because it isn't cleaned up yet, and may be restructured heavily.

The Questions:

How can I write my code to allow it to JIT-compile to a form using faster SSE operations?
How can I structure it so the compiler can easily eliminate array bounds checks?
Are there any broad references about the relative speed of specific math operations (how many increments/decrements does it take to equal a normal add/subtract, how fast is shift-or vs. an array access)?
How can I work on optimizing branching -- is it better to have numerous conditional statements with short bodies, or a few long ones, or short ones with nested conditions?
With current 1.6 JVM, how many elements must be copied before System.arraycopy beats a copying loop?

What I've already done:

Before I get attacked for premature optimization: the basic algorithm is already excellent, but the Java implementation is less than 2/3 the speed of equivalent C. I've already replaced copying loops with System.arraycopy, worked on optimizing loops and eliminated un-needed operations.

I make heavy use of bit twiddling and packing bytes into ints for performance, as well as shifting & masking.

For legal reasons, I can't look at implementations in similar libraries, and existing libraries have too restrictive license terms to use.

Requirements for a GOOD (accepted) answer:

Unacceptable answers: "this is faster" without an explanation of how much AND why, OR hasn't been tested with a JIT compiler.
Borderline answers: have not been tested with anything before Hotspot 1.4
Basic answers: will provide a general rule and explanation of why it is faster at the compiler level, and roughly how much faster
Good answers: include a couple of samples of code to demonstrate
Excellent answers: have benchmarks with both JRE 1.5 and 1.6
PERFECT answer: Is by someone who worked on the HotSpot compiler, and can fully explain or reference the conditions for an optimization to be used, and how much faster it typically is. Might include java code and sample assembly code generated by HotSpot.

Also: if anyone has links detailing the guts of Hotspot optimization and branching performance, those are welcome. I know enough about bytecode that a site analyzing performance at a bytecode rather than sourcecode level would be helpful.

(Edit) Partial Answer: Bounds-Check Ellimination:

This is taken from supplied link to the HotSpot internals wiki at: https://wikis.oracle.com/display/HotSpotInternals/RangeCheckElimination

HotSpot will eliminate bounds checks in all for loops with the following conditions:

Array is loop invariant (not reallocated within the loop)
Index variable has a constant stride (increases/decreases by constant amount, in only one spot if possible)
Array is indexed by a linear function of the variable.

Example: int val = array[index*2 + 5]

OR: int val = array[index+9]

NOT: int val = array[Math.min(var,index)+7]

Early version of code:

This is a sample version. Do not steal it, because it is an unreleased version of code for the H2 database project. The final version will be open source. This is an optimization upon the code here: H2 CompressLZF code

Logically, this is identical to the development version, but that one uses a for(...) loop to step through input, and an if/else loop for different logic between literal and backreference modes. It reduces array access and checks between modes.

public int compressNewer(final byte[] in, final int inLen, final byte[] out, int outPos){
        int inPos = 0;
        // initialize the hash table
        if (cachedHashTable == null) {
            cachedHashTable = new int[HASH_SIZE];
        } else {
            System.arraycopy(EMPTY, 0, cachedHashTable, 0, HASH_SIZE);
        }
        int[] hashTab = cachedHashTable;
        // number of literals in current run
        int literals = 0;
        int future = first(in, inPos);
        final int endPos = inLen-4;

        // Loop through data until all of it has been compressed
        while (inPos < endPos) {
                future = (future << 8) | in[inPos+2] & 255;
//                hash = next(hash,in,inPos);
                int off = hash(future);
                // ref = possible index of matching group in data
                int ref = hashTab[off];
                hashTab[off] = inPos;
                off = inPos - ref - 1; //dropped for speed

                // has match if bytes at ref match bytes in future, etc
                // note: using ref++ rather than ref+1, ref+2, etc is about 15% faster
                boolean hasMatch = (ref > 0 && off <= MAX_OFF && (in[ref++] == (byte) (future >> 16) && in[ref++] == (byte)(future >> 8) && in[ref] == (byte)future));

                ref -=2; // ...EVEN when I have to recover it
                // write out literals, if max literals reached, OR has a match
                if ((hasMatch && literals != 0) || (literals == MAX_LITERAL)) {
                    out[outPos++] = (byte) (literals - 1);
                    System.arraycopy(in, inPos - literals, out, outPos, literals);
                    outPos += literals;
                    literals = 0;
                }

                //literal copying split because this improved performance by 5%

                if (hasMatch) { // grow match as much as possible
                    int maxLen = inLen - inPos - 2;
                    maxLen = maxLen > MAX_REF ? MAX_REF : maxLen;
                    int len = 3;
                    // grow match length as possible...
                    while (len < maxLen && in[ref + len] == in[inPos + len]) {
                        len++;
                    }
                    len -= 2;

                    // short matches write length to first byte, longer write to 2nd too
                    if (len < 7) {
                        out[outPos++] = (byte) ((off >> 8) + (len << 5));
                    } else {
                        out[outPos++] = (byte) ((off >> 8) + (7 << 5));
                        out[outPos++] = (byte) (len - 7);
                    }
                    out[outPos++] = (byte) off;
                    inPos += len;

                    //OPTIMIZATION: don't store hashtable entry for last byte of match and next byte
                    // rebuild neighborhood for hashing, but don't store location for this 3-byte group
                    // improves compress performance by ~10% or more, sacrificing ~2% compression...
                    future = ((in[inPos+1] & 255) << 16) | ((in[inPos + 2] & 255) << 8) | (in[inPos + 3] & 255);
                    inPos += 2;
                } else { //grow literals
                    literals++;
                    inPos++;
                } 
        }
        
        // write out remaining literals
        literals += inLen-inPos;
        inPos = inLen-literals;
        if(literals >= MAX_LITERAL){
            out[outPos++] = (byte)(MAX_LITERAL-1);
            System.arraycopy(in, inPos, out, outPos, MAX_LITERAL);
            outPos += MAX_LITERAL;
            inPos += MAX_LITERAL;
            literals -= MAX_LITERAL;
        }
        if (literals != 0) {
            out[outPos++] = (byte) (literals - 1);
            System.arraycopy(in, inPos, out, outPos, literals);
            outPos += literals;
        }
        return outPos; 
    }

Final edit:

I've marked the best answer so far as accepted, since the deadline is nearly up. Since I took so long before deciding to post code, I will continue to upvote and respond to comments where possible. Apologies if the code is messy: this represented code in development, not polished up for committing.

Obstruct answered 29/8, 2009 at 21:28 Comment(5)

some notes. you may want to create the cachedHashTable on construction (of the class if it is static) If this is per thread then this should be made clear in the api (I.E. the user is responsible for creating a specific instance of the compressor and that holds the cached table. The branch prediction will liekly get it right but you may get some benefit from this getting into the later generations faster as well as simplifying the code. – Keelson 15/9, 2009 at 11:13

a nasty hack - but given that you're doing something nasty with it anyway. you don't need to decrement ref by 2 unless it is actually used, and it is only used ion one place after (within an if statement) so simply try: while (len < maxLen && in[ref + len - 2] and remove the ref -= 2 – Keelson 15/9, 2009 at 11:17

Yuck, yeah, that is kind of nasty. I've got a couple other nasty hacks I can do too. If you think the code is bad here, you should see the original, though! – Obstruct 15/9, 2009 at 16:1

I've got a version that pre-initializes the hashtable, and uses a range check on the possible backref. But, with short in/out buffers (8192 kB is common, when compressing streams), there is a benefit to being able to quickly eliminate possible backrefs when the ref is 0. – Obstruct 15/9, 2009 at 16:6

+1 I have been scouring docs on several free JVMs and cannot find any clear mention of how and when SIMD (SSE etc) are used by the JVM/JIT. Only that they do indeed use them since 1.4.2. Compiler / JVM developers definitely gun down any idea of "providing hints" or "optimizing for" via code structure. There were some Snorcle JVM patches in 6u2 referencing "SIMD optimizations" in bug 6536652; if you follow the related bug links, it appears that the JVM will do some rudimentary loop parallelization but details of the actual implementation and its use are scarce. – Steroid 11/11, 2010 at 0:35

Not a full answer, I simply don't have time to do the detailed benchmarks your question needs but hopefully useful.

Know your enemy

You are targeting a combination of the JVM (in essence the JIT) and the underlying CPU/Memory subsystem. Thus "This is faster on JVM X" is not likely to be valid in all cases as you move into more aggressive optimisations.

If your target market/application will largely run on a particular architecture you should consider investing in tools specific to it. * If your performance on x86 is the critical factor then intel's VTune is excellent for drilling down into the sort of jit output analysis you describe. * The differences between 64 bit and 32 bit JITs can be considerable, especially on x86 platforms where calling conventions can change and enregistering opportunities are very different.

Get the right tools

You would likely want to get a sampling profiler. The overhead of instrumentation (and the associated knock on on things like inlining, cache pollution and code size inflation) for your specific needs would be far too great. The intel VTune analyser can actually be used for Java though the integration is not so tight as others.
If you are using the sun JVM and are happy only knowing what the latest/greatest version is doing then the options available to investigate the output of the JIT are considerable if you know a bit of assembly. This article details some interesting analysis using this functionality

Learn from other implementations

The Change history change history indicates that previous inline assembly was in fact counter productive and that allowing the compiler to take total control of the output (with tweaks in code rather than directives in assembly) yielded better results.

Some specifics

Since LZF is, in an efficient unmanaged implementation on modern desktop CPUS, largely memory bandwidth limited (hence it being compered to the speed of an unoptimised memcpy) you will need you code to remain entirely within level 1 cache.
As such any static fields you cannot make into constants should be placed within the same class as these values will often be placed within the same area of memory devoted to the vtables and meta data associated with classes.

Object allocations which cannot be trapped by Escape Analysis (only in 1.6 onwards) will need to be avoided.

The c code makes aggressive use of loop unrolling. However the performance of this on older (1.4 era) VM's is heavily dependant on the mode the JVM is in. Apparently latter sun jvm versions are more aggressive at inlining and unrolling, especially in server mode.

The prefetch instrctions generated by the JIT can make all the difference on code like this which is near memory bound.

"It's coming straight for us"

Your target is moving, and will continue to. Again Marc Lehmann's previous experience:

default HLOG size is now 15 (cpu caches have increased)

Even minor updates to the jvm can involve significant compiler changes

6544668 Don't vecorized array operations that can't be aligned at runtime. 6536652 Implement some superword (SIMD) optimizations 6531696 don't use immediate 16-bits value store to memory on Intel cpus 6468290 Divide and allocate out of eden on a per cpu basis

Captain Obvious

Measure, Measure, Measure. IF you can get your library to include (in a separate dll) a simple and easy to execute benchmark that logs the relevant information (vm version, cpu, OS, command line switches etc) and makes this simple to send back to you you will increase your coverage, best of all you'll cover those people using it that care.

Keelson answered 11/9, 2009 at 22:0 Comment(9)

Good answer (best so far)! Do you know any sites that provide info about the cycles required for specific assembly commands? Or advice about how bytecode maps to x86 assembly? I don't know that much about assembly. VTune is a little beyond my budget right now (heh), although I'll keep it in mind for the future. I'm working on a version with more predictable array access, and less array fetches. There is zero object allocation, but the branches in code are more complex. Are constant additions (var2 = var1+2, for example) likely to be optimized heavily? – Obstruct 14/9, 2009 at 15:46

cycles per instruction are no longer a meaningful metirc since it is heavily dependent on the interaction with other instructions and dependencies, cache availability, and OOE along with pipeline depth. The jit mapping is way more complex that "X byte code => Y x86". sequential array access is good but the branchiness may need to provide > linear speed up for this to pay off significantly. Any compie time constants are folded by the compiler. – Keelson 14/9, 2009 at 16:32

Addition of a variable constant may not make a significant difference except the compiler may be able to effectively share this value in a register if it is used often. Free profilers are available for java. You should at least look at some of them, code.google.com/p/oktech-profiler is open source, free and has a sampling mode. – Keelson 14/9, 2009 at 16:33

I figured the bytecode mapping might be complex -- trying to simplify dependencies here where possible, but not sure how well it will work. Good to know that the additions do not pose problems -- they're just constants offsets relative to changing variables. I've done profiling, but it can't help further -- roughly 99% of runtime lives in the compression loop(s) and System.arraycopy to copy literal runs to output. Can you give any more info about how to approach optimizing for the cache and maximal pipeline depth, from the end of java source? – Obstruct 14/9, 2009 at 19:14

optimizing for cache sizes is simplest buy taking any and all buffers used and tweaking their sizes in relation to the size and associativity of your test machines and investigating the effects. (it's more complex but this can give you a start). pipeline usage is likely to be entirely dependent on the jit unless you can alter the algorithm to do a sort of vectorisation whereby you take something like A1B1C1A2B2C2. where C depends on B which depends on A and do A1A2B1B2C1C2 instead arraycopy doesn't sound great, why do you have to work in scratch space, can you not work in the output buffer? – Keelson 14/9, 2009 at 22:33

So, I can't do anything besides optimize buffers to encourage cache use? The only buffers involved are input and output buffers. Bytes are processed in sequence from input, and either used to increase the length of an incompressible run of literals, or a backreference to previous bytes (compression). Once either run has to end, it is written out (using arraycopy for literal runs) to the output buffer. ---------------------------------- I think this can be vectorized by packing to int/long, but this would make the branching extremely complex and add un-used array accesses. – Obstruct 14/9, 2009 at 23:5

I've posted an early version of the code (not the most recent one, which uses a FOR loop to step through the input buffer predictably. That one is still in debugging. – Obstruct 14/9, 2009 at 23:29

The arraycopy for literal runs is likely not something you can beat unless the common literal length is small (I would think at least less than 16bytes, perhaps less). Encouraging good cache use is about making the data you access close, It is likely in your case that the main issue is how you structure your 'look back' strutures based on how they are interogatted – Keelson 15/9, 2009 at 11:3

Yeah, the arraycopy is actually pretty optimal in modern JVMs. The latest HotSpot releases use hand-tuned assembly here, and are about 2x as fast copying 32 items. I don't think I can optimize look-back structures much, without adding copying to a cache-able small buffer (additional copy cost), or something. I'll see with the restructured loop. – Obstruct 15/9, 2009 at 16:0

As far as bounds check elimination is concerned, I believe the new JDK will already include an improved algorithm that eliminates it, whenever it's possible. These are the two main papers on this subject:

V. Mikheev, S. Fedoseev, V. Sukharev, N. Lipsky. 2002 Effective Enhancement of Loop Versioning in Java. Link. This paper is from the guys at Excelsior, who implemented the technique in their Jet JVM.
Würthinger, Thomas, Christian Wimmer, and Hanspeter Mössenböck. 2007. Array Bounds Check Elimination for the Java HotSpot Client Compiler. PPPJ. Link. Slightly based on the above paper, this is the implementation that I believe will be included in the next JDK. The achieved speedups are also presented.

There is also this blog entry, which discusses one of the papers superficially, and also presents some benchmarking results, not only for arrays but also for arithmetic in the new JDK. The comments of the blog entry are also very interesting, since the authors of the above papers present some very interesting comments and discuss arguments. Also, there are some pointers to other similar blog posts on this subject.

Escarp answered 14/9, 2009 at 15:35 Comment(1)

This IS very interesting, and yet another reason JDK/JRE 1.7 will be nice, if it ever gets released. Between this, the asynchronous I/O APIs (halfway between stream I/O and NIO), and the new concurrency APIs, it should bring java performance much closer to optimized C. Here, have an upvote for posting something useful (even if not an answer). – Obstruct 14/9, 2009 at 19:51

It's rather unlikely that you need to help the JIT compiler much with optimizing a straightforward number crunching algorithm like LZW. ShuggyCoUk mentioned this, but I think it deserves extra attention:

The cache-friendliness of your code will be a big factor.

You have to reduce the size of your woking set and improve data access locality as much as possible. You mention "packing bytes into ints for performance". This sounds like using ints to hold byte values in order to have them word-aligned. Don't do that! The increased data set size will outweigh any gains (I once converted some ECC number crunching code from int[] to byte[] and got a 2x speed-up).

On the off chance that you don't know this: if you need to treat some data as both bytes and ints, you don't have to shift and |-mask it - use ByteBuffer.asIntBuffer() and related methods.

With current 1.6 JVM, how many elements must be copied before System.arraycopy beats a copying loop?

Better do the benchmark yourself. When I did it way back when in Java 1.3 times, it was somewhere around 2000 elements.

Ceiba answered 14/9, 2009 at 20:10 Comment(4)

This is LZF, not LZW... so speed is essential. I've edited the original post to include the most recent stable version of the compression code, and a link to the earliest version (less optimized). Byte-to-int packing is used for the look-ahead, which is used in the hashtable of positions for 3-byte groups, and for checking if a candidate backref is usable. – Obstruct 14/9, 2009 at 23:27

Also: I am absolutely positive the figure is less than 32 elements now, because it as about 2x as fast using arraycopy vs. loop with this many elements. Then again, the loop wasn't fully optimal. – Obstruct 15/9, 2009 at 4:44

I believe the jit was changed a while back to allow the array copy to be special cased. I can't pull up a reference anywhere apart from ibm.com/developerworks/java/library/j-devrtj2.html which isn't clear when it changed. Of course this is per JVM as well – Keelson 15/9, 2009 at 11:9

ah yes - here's some more discussion on this: mail-archive.com/[email protected]/msg10172.html – Keelson 15/9, 2009 at 11:10

Lots of answers so far, but couple of additional things:

Measure, measure, measure. As much as most Java developers warn against micro-benchmarking, make sure you compare performance between changes. Optimizations that do not result in measurable improvements are generally not worth keeping (of course, sometimes it's combination of things, and that gets trickier)
Tight loops matter as much with Java as with C (and ditto with variable allocations -- although you don't directly control it, HotSpot will eventually have to do it). I manage to practically double the speed of UTF-8 decoding by rearranging tight loop for handling single-byte case (7-bit ascii) as tight(er) inner loop, leaving other cases out.
Do not underestimate cost of allocating and/or clearing large arrays -- if you want lzf encoding/decoding to be faster for small/medium chunks too (not just megabyte sized), keep in mind that ALL allocations of byte[]/int[] are somewhat costly; not because of GC, but because JVM MUST clear the space.

H2 implementation has also been optimized quite a bit (for example: it does not clear the hash array any more, this often makes sense); and I actually helped modify it for use in another Java project. My contribution was mostly just changing it do be more optimal for non-streaming case, but that did not touch the tight encode/decode loops.

Lonesome answered 8/12, 2009 at 5:38 Comment(1)

I'm well aware of the recent H2 optimizations, and am partially responsible: several of the optimizations are actually based on a draft version of my code I sent around the time I posted this. This includes the re-use of the "hash" variable when checking for a possible backref, the cleaner loop structure, and not initializing the hashtable. Out of curiosity, what were your modifications? As a teaser: look for a couple higher-compression options and faster decompression in the H2 code when stuff gets out of code review. – Obstruct 8/12, 2009 at 6:1

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++