unexplained 10%+ performance boost from simply adding a method argument (slimmer jit code)

Asked 10/4, 2018 at 12:52 Answered 30/4, 2018 at 13:23

(note: proper answer must go beyond reproduction).

After millions of invocations, quicksort1 is definitely faster than quicksort2, which have identical code aside from this 1 extra arg.

The code is at the end of the post. Spoiler: I also found the jit code is fatter by 224 bytes even if it should be actually simpler (like the byte code size tells; see very last update below).

Even after trying to factor out this effect with some microbenchmark harness (JMH), the performance difference is still there.

I'm asking: WHY is there a such a difference in native code generated and what is it doing?

By adding an argument to a method, it makes it faster...! I know about gc/jit/warmup/etc effects. You can run code as-is, or with larger/smaller iteration counts. Actually, you should even comment out one then the other perf test and run each in distinct jvm instance just to prove it's not an interference between each other.

The bytecode doesn't show much difference, aside the obvious getstatic for sleft/sright but also a strange 'iload 4' instead of "iload_3" (and istore 4/istore_3)

What the heck is going on? Is the iload_3/istore_3 really slower than iload 4/istore 4? And that much slower that even the added getstatic call is still not making it slower? I can guess that static fields are unused so the jit might just skip it.

Anyway, there is no ambiguity on my side as it is always reproducable, and I'm looking for the explanation as to why the javac/jit did what they did, and why the performance is affected so much. These are identical recursive algo with same data, same memory churn, etc... I couldn't make an more isolated change if I wanted to, to show a significant replicable runtime difference.

Env:

java version "1.8.0_161" 
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)
(also tried and reproduced on java9)
on a 4 core i5 laptop 8GB ram.
windows 10 with the meltdown/specter patch.

With -verbose:gc -XX:+PrintCompilation, there is no gc and jit compilation has stabilised in C2 (tier 4).

With n=20000:

main]: qs1: 1561.3336199999999 ms (res=null)
main]: qs2: 1749.748416 ms (res=null)

main]: qs1: 1422.0767509999998 ms (res=null)
main]: qs2: 1700.4858689999999 ms (res=null)

main]: qs1: 1519.5280269999998 ms (res=null)
main]: qs2: 1786.2206899999999 ms (res=null)

main]: qs1: 1450.0802979999999 ms (res=null)
main]: qs2: 1675.223256 ms (res=null)

main]: qs1: 1452.373318 ms (res=null)
main]: qs2: 1634.581156 ms (res=null)

BTW, beautiful java9 seems to make both faster but still 10-15% off from each other.:

[0.039s][info][gc] Using G1
main]: qs1: 1287.062819 ms (res=null)
main]: qs2: 1451.041391 ms (res=null)

main]: qs1: 1240.800305 ms (res=null)
main]: qs2: 1391.2404299999998 ms (res=null)

main]: qs1: 1257.1707159999999 ms (res=null)
main]: qs2: 1433.84716 ms (res=null)

main]: qs1: 1233.7582109999998 ms (res=null)
main]: qs2: 1394.7195849999998 ms (res=null)

main]: qs1: 1250.885867 ms (res=null)
main]: qs2: 1413.88437 ms (res=null)

main]: qs1: 1261.5805739999998 ms (res=null)
main]: qs2: 1458.974334 ms (res=null)

main]: qs1: 1237.039902 ms (res=null)
main]: qs2: 1394.823751 ms (res=null)

main]: qs1: 1255.14672 ms (res=null)
main]: qs2: 1400.693295 ms (res=null)

main]: qs1: 1293.009808 ms (res=null)
main]: qs2: 1432.430952 ms (res=null)

main]: qs1: 1262.839628 ms (res=null)
main]: qs2: 1421.376644 ms (res=null)

CODE (INCLUDING TESTS):

(Please don't pay attention to how bad this quicksort is; it's beside the question).

import java.util.Random;
import java.util.concurrent.Callable;

public class QuicksortTrimmed {

    static void p(Object msg) {
        System.out.println(Thread.currentThread().getName()+"]: "+msg);
    }

    static void perf(int N, String msg, Callable c) throws Exception {
        Object res = null;
        long t = System.nanoTime();
        for(int i=0; i<N; i++) {
            res = c.call();
        }
        Double d = 1e-6*(System.nanoTime() - t);
        p(msg+": "+d+" ms (res="+res+")");
    }

    static String und = "__________";//10
    static {
        und += und;//20
        und += und;//40
        und += und;//80
        und += und;//160
    }

    static String sleft = "//////////";//10
    static {
        sleft += sleft;//20
        sleft += sleft;//40
        sleft += sleft;//80
        sleft += sleft;//160
    }

    static String sright= "\\\\\\\\\\\\\\\\\\\\";//10
    static {
        sright += sright;//20
        sright += sright;//40
        sright += sright;//80
        sright += sright;//160
    }

    //============

    public static void main(String[] args) throws Exception {
        int N = 20000;
        int n = 1000;
        int bound = 10000;
        Random r = new Random(1);
        for(int i=0; i<5; i++) {
            testperf(N, r, n, bound);
            //System.gc();
        }
    }

    static void testperf(int N, Random r, int n, int bound) throws Exception {
        final int[] orig = r.ints(n, 0, bound).toArray();
        final int[] a = orig.clone();

        perf(N, "qs1", () -> {
            System.arraycopy(orig, 0, a, 0, orig.length);
            quicksort1(a, 0, a.length-1, und);
            return null;
        });

        perf(N, "qs2", () -> {
            System.arraycopy(orig, 0, a, 0, orig.length);
            quicksort2(a, 0, a.length-1);
            return null;
        });
        System.out.println();
    }


    private static void quicksort1(int[] a, final int _from, final int _to, String mode) {
        int len = _to - _from + 1;
        if(len==2) {
            if(a[_from] > a[_to]) {
                int tmp = a[_from];
                a[_from] = a[_to];
                a[_to] = tmp;
            }
        } else { //len>2
            int mid = _from+len/2;
            final int pivot = a[mid];
            a[mid] = a[_to];
            a[_to] = pivot; //the pivot value is the 1st high value

            int i = _from;
            int j = _to;

            while(i < j) {
                if(a[i] < pivot)
                    i++;
                else if(i < --j) { //j is the index of the leftmost high value 
                    int tmp = a[i];
                    a[i] = a[j];  //THIS IS HARMFUL: maybe a[j] was a high value too.
                    a[j] = tmp;
                }
            }

            //swap pivot in _to with other high value in j
            int tmp = a[j];
            a[j] = a[_to];
            a[_to] = tmp;

            if(_from < j-1)
                quicksort1(a, _from, j-1, sleft);
            if(j+1 < _to)
                quicksort1(a, j+1, _to, sright);
        }
    }

    private static void quicksort2(int[] a, final int _from, final int _to) {
        int len = _to - _from + 1;
        if(len==2) {
            if(a[_from] > a[_to]) {
                int tmp = a[_from];
                a[_from] = a[_to];
                a[_to] = tmp;
            }
        } else { //len>2
            int mid = _from+len/2;
            final int pivot = a[mid];
            a[mid] = a[_to];
            a[_to] = pivot; //the pivot value is the 1st high value

            int i = _from;
            int j = _to;

            while(i < j) {
                if(a[i] < pivot)
                    i++;
                else if(i < --j) { //j is the index of the leftmost high value 
                    int tmp = a[i];
                    a[i] = a[j];  //THIS IS HARMFUL: maybe a[j] was a high value too.
                    a[j] = tmp;
                }
            }

            //swap pivot in _to with other high value in j
            int tmp = a[j];
            a[j] = a[_to];
            a[_to] = tmp;

            if(_from < j-1)
                quicksort2(a, _from, j-1);
            if(j+1 < _to)
                quicksort2(a, j+1, _to);
        }
    }

}

UPDATE:

I made the JMH test and it still proves quicksort1 is faster than quicksort2.

# Run complete. Total time: 00:13:38

Benchmark                    Mode  Cnt      Score    Error  Units
MyBenchmark.testQuickSort1  thrpt  200  14861.437 ± 86.707  ops/s
MyBenchmark.testQuickSort2  thrpt  200  13097.209 ± 46.178  ops/s

Here is the JMH benchmark:

package org.sample;

import java.util.Random;

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.Level;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.infra.Blackhole;

public class MyBenchmark {
    static String und = "__________";//10
    static {
        und += und;//20
        und += und;//40
        und += und;//80
        und += und;//160
    }

    static String sleft = "//////////";//10
    static {
        sleft += sleft;//20
        sleft += sleft;//40
        sleft += sleft;//80
        sleft += sleft;//160
    }

    static String sright= "\\\\\\\\\\\\\\\\\\\\";//10
    static {
        sright += sright;//20
        sright += sright;//40
        sright += sright;//80
        sright += sright;//160
    }

    static final int n = 1000;
    static final int bound = 10000;
    static Random r = new Random(1);
    static final int[] orig = r.ints(n, 0, bound).toArray();

    @State(Scope.Thread)
    public static class ThrState {
        int[] a;

        @Setup(Level.Invocation)
        public void setup() {
            a = orig.clone();
        }
    }

    //============

    @Benchmark
    public void testQuickSort1(Blackhole bh, ThrState state) {
        int[] a = state.a;
        quicksort1(a, 0, a.length-1, und);
        bh.consume(a);
    }

    @Benchmark
    public void testQuickSort2(Blackhole bh, ThrState state) {
        int[] a = state.a;
        quicksort2(a, 0, a.length-1);
        bh.consume(a);
    }


    private static void quicksort1(int[] a, final int _from, final int _to, String mode) {
        int len = _to - _from + 1;
        if(len==2) {
            if(a[_from] > a[_to]) {
                int tmp = a[_from];
                a[_from] = a[_to];
                a[_to] = tmp;
            }
        } else { //len>2
            int mid = _from+len/2;
            final int pivot = a[mid];
            a[mid] = a[_to];
            a[_to] = pivot; //the pivot value is the 1st high value

            int i = _from;
            int j = _to;

            while(i < j) {
                if(a[i] < pivot)
                    i++;
                else if(i < --j) { //j is the index of the leftmost high value 
                    int tmp = a[i];
                    a[i] = a[j];  //THIS IS HARMFUL: maybe a[j] was a high value too.
                    a[j] = tmp;
                }
            }

            //swap pivot in _to with other high value in j
            int tmp = a[j];
            a[j] = a[_to];
            a[_to] = tmp;

            if(_from < j-1)
                quicksort1(a, _from, j-1, sleft);
            if(j+1 < _to)
                quicksort1(a, j+1, _to, sright);
        }
    }

    private static void quicksort2(int[] a, final int _from, final int _to) {
        int len = _to - _from + 1;
        if(len==2) {
            if(a[_from] > a[_to]) {
                int tmp = a[_from];
                a[_from] = a[_to];
                a[_to] = tmp;
            }
        } else { //len>2
            int mid = _from+len/2;
            final int pivot = a[mid];
            a[mid] = a[_to];
            a[_to] = pivot; //the pivot value is the 1st high value

            int i = _from;
            int j = _to;

            while(i < j) {
                if(a[i] < pivot)
                    i++;
                else if(i < --j) { //j is the index of the leftmost high value 
                    int tmp = a[i];
                    a[i] = a[j];  //THIS IS HARMFUL: maybe a[j] was a high value too.
                    a[j] = tmp;
                }
            }

            //swap pivot in _to with other high value in j
            int tmp = a[j];
            a[j] = a[_to];
            a[_to] = tmp;

            if(_from < j-1)
                quicksort2(a, _from, j-1);
            if(j+1 < _to)
                quicksort2(a, j+1, _to);
        }
    }

}

UPDATE:

At this moment, I ran and captured a jit log for jitwatch (I used the 1.3.0 tag and built from https://github.com/AdoptOpenJDK/jitwatch/tree/1.3.0)

-verbose:gc
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails
-Xloggc:"./gc.log"
-XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintCompilation
-XX:+PrintSafepointStatistics
-XX:PrintSafepointStatisticsCount=1
-XX:+UnlockDiagnosticVMOptions  -XX:+LogCompilation -XX:+PrintInlining

There are no obvious "suggestions" from jitwatch, just the rgular too big to inline or too deep, for both quicksort1 and quicksort2 anyway.

The one important discovery is the bytecode and native code difference:

With extra method argument (quicksort1): byte code = 187 bytes native code = 1872 bytes

Without extra method argument (quicksort2): byte code = 178 bytes (smaller by 9 bytes) native code = 2096 bytes (bigger by 224 bytes!!!)

This is a strong evidence that the jit code is fatter and slower in quicksort2.

So the question remains: what was the C2 jit compiler thinking? what rule made it create faster native code when I add a method argument and 2 static reference to load and pass?

I finally got my hand on the assembly code, but as I expected, it's nearly impossible to diff and understand what's happening. I followed the most recent instruction I could find from https://mcmap.net/q/47244/-how-to-see-jit-compiled-code-in-jvm . I have 7MB xml log file (compressed to 675kB) you can get it and see for 7 days (expire ~may 4th 2018) at https://wetransfer.com/downloads/65fe0e94ab409d57cba1b95459064dd420180427150905/612dc9 in case you can make sense of it (in jitwatch of course!).

The added string param leads to more compact assembly code. The questions (still unanswered) is why? what is different in the assembly code? what is the rule or optimization that isn't used in the slower code?

Weimaraner answered 10/4, 2018 at 12:52 Comment(13)

#504603 – Brouhaha 10/4, 2018 at 12:53

I'm testing over 40'000 times each in-place sort with some cpu and young gc overhead cost for the 4 random array and their 40'000 clones. Still, the effect is persistent. Again, these are not 2 distinct algos, they are identical, aside from a method argument which somehow makes it faster. Any hypothesis will help. – Weimaraner 10/4, 2018 at 13:5

A microbenchmark that fails will help others to reproduce your problem (or to invalidate it). – Peggiepeggir 10/4, 2018 at 13:30

@Peggiepeggir This is a microbenchmark and it fails (always for me). It's not just the subject code; it's the whole test. – Weimaraner 10/4, 2018 at 20:23

Replace Double d = 1e-6*(System.nanoTime() - t); with double d = (System.nanoTime() - t)/1e6; and you won't get rounding errors. – Myoglobin 10/4, 2018 at 21:45

More precisely: Use a jmh-microbenchmark. If it still behaves anomal, everybody can reproduce your result on his machine. – Peggiepeggir 10/4, 2018 at 21:59

I could not reproduce this on 1.8.0_162, but on Java 9. Perhaps, this depends on fragile environmental conditions. – Pilchard 11/4, 2018 at 13:47

@PeterLawrey do you really think, the difference between 1e-6*someLongValue and someLongValue/1e6 has any practical relevance? I mean, with double precision, we’re talking about an error of less than 1e-12, i.e. less than 1e-6 nanoseconds here… – Pilchard 11/4, 2018 at 15:21

JMH test added, and reproduced the problem. ~14800 ops/s vs ~13000 ops/s – Weimaraner 15/4, 2018 at 14:20

Jitwatch revealed that the native code is 224 bytes bigger in quicksort2. Still no clue as to why. – Weimaraner 23/4, 2018 at 17:45

The JIT compiler can have bugs. – Un 26/4, 2018 at 21:45

Where is your @Warmup? – Un 26/4, 2018 at 21:47

@ThomasRS, the default is 20 (as I observed the output, 10 forks of 20x1second in throughput mode is the 200 count in the results.), thus about 260'000 runs minimum in my results. – Weimaraner 26/4, 2018 at 23:32

I think I have observed something odd in the assembly code.

First, I added blank lines so that the quicksort1 starts at line 100 and quicksort2 starts at line 200. A lot simpler to line up assembly code.

I also changed the string arg to a int arg, just to test and prove the type is not the issue.

After a tedious task of lining up asm code in excel, here is the xls file: https://wetransfer.com/downloads/e56fd98fe248cef98f5a242b2db64f6920180430130753/7b8f2b (availaible for 7 days). (I'm sorry if I'm not consistent in my coloring, I got fed up...)

The pattern I see is that there are more mov ops to prepare the quicksort2. If I understand correctly, the inlining of native code would be longer, and because of recursion, it degenerate a few level but enough to cause the slowdown. I don't understand well enough the ops to guess beyond that.

In other words, when the last quicksort stackframes from the recursion return point up maybe 3 or 5 levels (hard to tell) can be inlined, then it resorts to jump. However, these bytecode frames of quicksort2 using more native code for unclear reasons, add up to hundreds of additional ops.

At this point, I'm 50% into the answer. C2 creates slightly fatter code, but gets inflated due to inlining of recursive tail frames.

I think I'm going to file a bug to oracle... This has been quite a challenge, but in the end, it's very disappointing that unused java code leads to better performance!

Weimaraner answered 30/4, 2018 at 13:23 Comment(0)

Reproduction and Analysis

I was able to reproduce your results. Machine data:

Linux #143-Ubuntu x86_64 GNU/Linux
java version "1.8.0_171"
Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)

I rewrote your code a bit and I have done some additional testing. Your test time includes the System.arraycopy() call. I created a 400Mb array structure and saved it:

int[][][] data = new int[iterations][testCases][];
for (int iteration = 0; iteration < data.length; iteration++) {
    for (int testcase = 0; testcase < data[iteration].length; testcase++) {
        data[iteration][testcase] = random.ints(numberCount, 0, bound).toArray();
    }
}

FileOutputStream fos = new FileOutputStream("test_array.dat");
ObjectOutputStream oos = new ObjectOutputStream(fos);
oos.writeObject(data);

After that I have run this tests (warmup, teardown run as well):

{
    FileInputStream fis = new FileInputStream(fileName);
    ObjectInputStream iis = new ObjectInputStream(fis);
    int[][][] data = (int[][][]) iis.readObject();


    perf("qs2", () -> {
        for (int iteration = 0; iteration < data.length; iteration++) {
            for (int testCase = 0; testCase < data[iteration].length; testCase++) {
                quicksort2(data[iteration][testCase], 0, data[iteration][testCase].length - 1);
            }
        }
        return null;
    });
}
{
    FileInputStream fis = new FileInputStream(fileName);
    ObjectInputStream iis = new ObjectInputStream(fis);
    int[][][] data = (int[][][]) iis.readObject();


    perf("qs1", () -> {
        for (int iteration = 0; iteration < data.length; iteration++) {
            for (int testCase = 0; testCase < data[iteration].length; testCase++) {
                quicksort1(data[iteration][testCase], 0, data[iteration][testCase].length - 1, und);
            }
        }
        return null;
    });
}

In case I run the qs1 and qs2 together:

main]: qs1: 6646.219874 ms (res=null)
main]: qs2: 7418.376646 ms (res=null)

The result is not execution order dependent:

main]: qs2: 7526.215395 ms (res=null)
main]: qs1: 6624.261529 ms (res=null)

I have run the code in new JVM instances as well:

Instance one:

main]: qs1: 6592.699738 ms (res=null)

Instance two:

main]: qs2: 7456.326028 ms (res=null)

If you try it without the JIT:

-Djava.compiler=NONE

The results are as "expected" (the smaller bytecode is faster):

main]: qs1: 56547.589942 ms (res=null)
main]: qs2: 53585.909246 ms (res=null)

For better analysis I extracted the codes to two different classes.

I was using jclasslib for bytecode inspection. The method lengths for me:

Q1: 505
Q2: 480

This makes sense for the execution without the JIT:

53585.909246×505÷480 = 56376.842019229

Which is really close to 56547.589942.

Reason

For me in the compilation output (using -XX:+PrintCompilation) I have these lines

1940  257       2       QS1::sort (185 bytes)
1953  258 %     4       QS1::sort @ 73 (185 bytes)
1980  259       4       QS1::sort (185 bytes)
1991  257       2       QS1::sort (185 bytes)   made not entrant
9640  271       3       QS2::sort (178 bytes)
9641  272       4       QS2::sort (178 bytes)
9654  271       3       QS2::sort (178 bytes)   made not entrant

Where the % means on stack replacement (where the compiled code is running). According to this log the call with the extra String parameter gets optimized and the second is not. I was thinking on better branch prediction, but this should not be the case here (tried to add randomli generated Strings as parameters). The sample sizes (400Mb) mostly rule out caching. I thought on optimization treshold, but when I use these options -Xcomp -XX:+PrintCompilation -Xbatch the output is the following:

 6408 3254    b  3       QS1::sort (185 bytes)
 6409 3255    b  4       QS1::sort (185 bytes)
 6413 3254       3       QS1::sort (185 bytes)   made not entrant
14580 3269    b  3       QS2::sort (178 bytes)
14580 3270    b  4       QS2::sort (178 bytes)
14584 3269       3       QS2::sort (178 bytes)   made not entrant

This means that the metods are fored blocking compiled before called but the times remain the same:

main]: qs1: 6982.721328 ms (res=null)
main]: qs2: 7606.077812 ms (res=null)

The key to this I think is the String. In case I change the extra (unused) parameter to int it is consistently processed slightly slower (running with the previous optimization parameters):

main]: qs1: 7925.472909 ms (res=null)
main]: qs2: 7727.628422 ms (res=null)

My conclusion is that the optimization may be influenced by the extra parameters object type. Probably there is less eager optimization in case of primitives which makes sense to me, but I could not find exact source for that claim.

An additional interesting read.

Thoroughgoing answered 27/4, 2018 at 10:17 Comment(7)

Thanks. For the same cpu cost I chose arraycopy-per-call to keep memory low. I concur in interpreted mode the discrepandy essentially gone. The printcompilation % means a partial (loop) compilation, and yes it implies a stackframe replacement, but just once at the switch. I have not noticed asymetry like you did. The more important thing is the native size reported by jitwatch. I hope to find the assembly code of each to guess what is different, but that's evidence for the jit gurus out there. – Weimaraner 27/4, 2018 at 12:55

And of course getting a debug build of jdk (for -XX:+PrintOptoAssembly) is next to impossible, and all the doc for the hsdis library is so obsolete, I gave up... 3 times now. The jit guys must using something secret or have a completely different way to see generated assembly code... – Weimaraner 27/4, 2018 at 14:30

Got my hand on the assembly code. See bottom of edited question. – Weimaraner 27/4, 2018 at 15:14

Thanks for the reproduction of the issue. Unfortunately your conclusion is just re-stating my question. As is, it doesn't explain anything. – Weimaraner 28/4, 2018 at 11:17

Also, I tried your type change (from String to int) and it still has the performance discrepancy. I even set that int to 0 in all calls, in case it could affect the jit decision but it doesn't. – Weimaraner 30/4, 2018 at 12:4

@Weimaraner it’s not Stackoverflow’s fault. The bounty has been awarded automatically after the elapsed time due to the upvotes. So you have to blame the users who upvoted… – Pilchard 7/5, 2018 at 8:38

@Holger: users have not programmed this automatic award. Besides, it was supposed to begin at 50 points, but somehow I was forced to award 100 points. It's just too broken to ever want to use again. – Weimaraner 7/5, 2018 at 18:29