Very Strange Efficiency Quirks while Sorting

Asked 7/10, 2016 at 23:43 Answered 9/10, 2016 at 17:4

Solved java algorithm performance sorting insertion-sort

I am currently taking a Data Structures class and, as you may expect, one of the things we have to do is write some of the common sorts. In writing my insertion sort algorithm, I noticed in ran significantly faster than that of my instructor's (for 400000 data points it took my algorithm about 30 seconds and his about 90). I emailed him my code and the same results happened when they were both running on the same machine. We managed to waste more than 40 minutes slowly changing his sorting method into mine until it was exactly the same, word for word, except for one seemingly arbitrary thing. First, here is my insertion sort code:

public static int[] insertionSort(int[] A){

    //Check for illegal cases
    if (A == null || A.length == 0){

        throw new IllegalArgumentException("A is not populated");

    }

    for(int i = 0; i < A.length; i++){

        int j = i;

        while(j > 0 && A[j - 1] > A[j]){

            int temp = A[j];
            A[j] = A[j - 1];
            A[j - 1] = temp;

            j--;

        }

    }

    return A;

}

Now at this point his code was exactly the same as mine except for the lines where we swap A[j] and A[j - 1]. His code did the following:

int temp = A[j - 1];
A[j - 1] = A[j];
A[j] = temp;

We found that those 3 lines are the culprits. My code was running significantly faster because of this. Perplexed, we ran javap -c to get the byte code for a simple program that just had a main which contained an array declaration, a variable declaration for int j and the 3 lines of code for swapping as I wrote them and as he wrote them. Here is the byte code for my swapping method:

    Compiled from "me.java"
public class me {
  public me();
    Code:
       0: aload_0
       1: invokespecial #1                  // Method java/lang/Object."<init>":()V
       4: return

  public static void main(java.lang.String[]);
    Code:
       0: sipush        10000
       3: newarray       int
       5: astore_1
       6: bipush        10
       8: istore_2
       9: aload_1
      10: iload_2
      11: iaload
      12: istore_3
      13: aload_1
      14: iload_2
      15: aload_1
      16: iload_2
      17: iconst_1
      18: isub
      19: iaload
      20: iastore
      21: aload_1
      22: iload_2
      23: iconst_1
      24: isub
      25: iload_3
      26: iastore
      27: return
}

And my instructor's method's byte code:

    Compiled from "instructor.java"
public class instructor {
  public instructor();
    Code:
       0: aload_0
       1: invokespecial #1                  // Method java/lang/Object."<init>":()V
       4: return

  public static void main(java.lang.String[]);
    Code:
       0: sipush        10000
       3: newarray       int
       5: astore_1
       6: bipush        10
       8: istore_2
       9: aload_1
      10: iload_2
      11: iconst_1
      12: isub
      13: iaload
      14: istore_3
      15: aload_1
      16: iload_2
      17: iconst_1
      18: isub
      19: aload_1
      20: iload_2
      21: iaload
      22: iastore
      23: aload_1
      24: iload_2
      25: iload_3
      26: iastore
      27: return
}

I don't see any real difference between these byte codes. What might cause this strange behavior (my code still ran ~3 times faster than his and as to be expected this difference got more drastic as we feed the algorithms larger arrays)? Is this simply a strange quirk of Java. Furthermore, does this happen on your computer? For reference, this was run on a MacBook Pro mid 2014 and my code is exactly as it appears here and his code was deduced to exactly the code as it appears here except for those 3 lines.

[EDIT] Here are my test classes:

public class Tester1 {

    public static void main(String[] args){

        int[] A = new int[400000];

        for(int i = 0; i < A.length; i++){

            A[i] = (int) (Math.random() * Integer.MAX_VALUE);

        }

        double start = System.currentTimeMillis();
        insertionSort(A);
        System.out.println("My insertion sort took " + (System.currentTimeMillis() - start) + " milliseconds.");


    }

    public static int[] insertionSort(int[] A){

        //Check for illegal cases
        if (A == null || A.length == 0){

            throw new IllegalArgumentException("A is not populated");

        }

        for(int i = 0; i < A.length; i++){

            int j = i;

            while(j > 0 && A[j - 1] > A[j]){

                int temp = A[j];
                A[j] = A[j - 1];
                A[j - 1] = temp;

                j--;

            }

        }

        return A;

    }

}

And the second file:

public class Tester2 {

    public static void main(String[] args){

        int[] A = new int[400000];

        for(int i = 0; i < A.length; i++){

            A[i] = (int) (Math.random() * Integer.MAX_VALUE);

        }

        double start = System.currentTimeMillis();
        otherInsertion(A);
        System.out.println("Other insertion sort took " + (System.currentTimeMillis() - start) + " milliseconds.");


    }


    public static int[] otherInsertion(int[] A){

        //Check for illegal cases
        if (A == null || A.length == 0){

            throw new IllegalArgumentException("A is not populated");

        }

        for(int i = 0; i < A.length; i++){

            int j = i;

            while(j > 0 && A[j - 1] > A[j]){

                int temp = A[j - 1];
                A[j - 1] = A[j];
                A[j] = temp;

                j--;

            }

        }

        return A;

    }

}

And the outputs (with no arguments, just java Tester1 and java Tester2):

My insertion sort took 37680.0 milliseconds.
Other insertion sort took 86358.0 milliseconds.

These were run as 2 separate files in 2 different JVMs.

Allround answered 7/10, 2016 at 23:43 Comment(10)

Interesting. Can you post one of your test classes? – Germ 7/10, 2016 at 23:56

@JonKiparsky I did. Also, I had it swapped around which is efficient. Assigning temp to A[j - 1] seemed to be more efficient. Also, a no temp swap makes it even worse. – Allround 8/10, 2016 at 0:10

Odd, is there anything obvious in the corresponding native code? – Safelight 8/10, 2016 at 0:13

(Good thing you didn't start to invest time in earnest to investigate results of sub-second measurement runs.) With multi-second runs on a JIT-enabled VM, inspecting the static byte code is close to pointless - compiled out of context or not. – Araroba 8/10, 2016 at 0:13

For things like this it's best to use a microbenchmark tool like caliper. trajano.net/2014/08/… – Contempt 8/10, 2016 at 0:13

FWIW, I just ran this code on a Windows machine JDK 8u101, both sorts take approximately the same amount of time. This sounds suspiciously like a branch misprediction problem, since the "other insertion sort" does fewer switches between indices, but I'm not familiar enough with that to know for sure. I wouldn't expect that to be nearly this dramatic though. – Enjambment 8/10, 2016 at 0:14

I can't help but be reminded of Perlisism 13: If two people write exactly the same program, each should be put into microcode and then they certainly won't be the same. Alan J. Perlis was a wise man. Second, your code stores the value indexed [j-1] last, which is not read on the next iteration. On the contrary, your instructor's code does read what was just stored in the previous iteration, so perhaps it suffers from what's called a load-hit-store. – Heteromerous 8/10, 2016 at 0:27

Sorry, I'm mistaken. j counts backwards, so [j-1] is read on the next iteration. So load-hit-stores weren't it. FWIW, here's OpenJDK's machine code dump. – Heteromerous 8/10, 2016 at 1:8

Your code is faster about 15% in my environment (Windows, JDK1.8.0_102) – Postiche 8/10, 2016 at 1:45

@saka1029 Maybe the Mac is using a smarter implementation than Windows. I am noticing that the original code was faster as well. See my answer. – Contempt 8/10, 2016 at 2:28

This is the effect of loop unrolling optimization along with common subexpression elimination. Depending on the order of array access instructions, JIT can eliminate redundant loads in one case, but not in the other.

Let me explain in details. In both cases JIT unrolls 4 iterations of the inner loop.

E.g. for your case:

    while (j > 3) {
        if (A[j - 1] > A[j]) {
            int temp = A[j];
            A[j] = A[j - 1];
            A[j - 1] = temp;         \
        }                             A[j - 1] loaded immediately after store
        if (A[j - 2] > A[j - 1]) {   /
            int temp = A[j - 1];
            A[j - 1] = A[j - 2];
            A[j - 2] = temp;         \
        }                             A[j - 2] loaded immediately after store
        if (A[j - 3] > A[j - 2]) {   /
            int temp = A[j - 2];
            A[j - 2] = A[j - 3];
            A[j - 3] = temp;         \
        }                             A[j - 3] loaded immediately after store
        if (A[j - 4] > A[j - 3]) {   /
            int temp = A[j - 3];
            A[j - 3] = A[j - 4];
            A[j - 4] = temp;
        }
        j -= 4;
    }

Then JIT eliminates redundant array loads, and the resulting assembly looks like

0x0000000002d53a70: movslq %r11d,%r10
0x0000000002d53a73: lea    0x0(%rbp,%r10,4),%r10
0x0000000002d53a78: mov    0x10(%r10),%ebx    ; ebx = A[j]
0x0000000002d53a7c: mov    0xc(%r10),%r9d     ; r9d = A[j - 1]

0x0000000002d53a80: cmp    %ebx,%r9d          ; if (r9d > ebx) {
0x0000000002d53a83: jle    0x0000000002d539f3 
0x0000000002d53a89: mov    %r9d,0x10(%r10)    ;     A[j] = r9d
0x0000000002d53a8d: mov    %ebx,0xc(%r10)     ;     A[j - 1] = ebx
                                              ; }
0x0000000002d53a91: mov    0x8(%r10),%r9d     ; r9d = A[j - 2]

0x0000000002d53a95: cmp    %ebx,%r9d          ; if (r9d > ebx) {  
0x0000000002d53a98: jle    0x0000000002d539f3                     
0x0000000002d53a9e: mov    %r9d,0xc(%r10)     ;     A[j - 1] = r9d    
0x0000000002d53aa2: mov    %ebx,0x8(%r10)     ;     A[j - 2] = ebx
                                              ; }             
0x0000000002d53aa6: mov    0x4(%r10),%r9d     ; r9d = A[j - 3]    

0x0000000002d53aaa: cmp    %ebx,%r9d          ; if (r9d > ebx) {  
0x0000000002d53aad: jle    0x0000000002d539f3                     
0x0000000002d53ab3: mov    %r9d,0x8(%r10)     ;     A[j - 2] = r9d
0x0000000002d53ab7: mov    %ebx,0x4(%r10)     ;     A[j - 3] = ebx
                                              ; }                 
0x0000000002d53abb: mov    (%r10),%r8d        ; r8d = A[j - 4]

0x0000000002d53abe: cmp    %ebx,%r8d          ; if (r8d > ebx) {
0x0000000002d53ac1: jle    0x0000000002d539f3  
0x0000000002d53ac7: mov    %r8d,0x4(%r10)     ;     A[j - 3] = r8
0x0000000002d53acb: mov    %ebx,(%r10)        ;     A[j - 4] = ebx
                                              ; }
0x0000000002d53ace: add    $0xfffffffc,%r11d  ; j -= 4
0x0000000002d53ad2: cmp    $0x3,%r11d         ; while (j > 3)
0x0000000002d53ad6: jg     0x0000000002d53a70

Your instructor's code will look different after loop unrolling:

    while (j > 3) {
        if (A[j - 1] > A[j]) {
            int temp = A[j - 1];
            A[j - 1] = A[j];
            A[j] = temp;         <-- another store instruction between A[j - 1] access
        }
        if (A[j - 2] > A[j - 1]) {
            int temp = A[j - 2];
            A[j - 2] = A[j - 1];
            A[j - 1] = temp;
        }
        ...

JVM will not eliminate the subsequent load of A[j - 1], because there is another store instruction after the previous load of A[j - 1] (though in this particular case such optimization is theoretically possible).

So, the assembly code will have more load instructions, and the performance will be worse:

0x0000000002b53a00: cmp    %r8d,%r10d          ; if (r10d > r8d) {
0x0000000002b53a03: jle    0x0000000002b53973
0x0000000002b53a09: mov    %r8d,0xc(%rbx)      ;     A[j - 1] = r8d
0x0000000002b53a0d: mov    %r10d,0x10(%rbx)    ;     A[j] = r10d
                                               ; }
0x0000000002b53a11: mov    0xc(%rbx),%r10d     ; r10d = A[j - 1]
0x0000000002b53a15: mov    0x8(%rbx),%r9d      ; r9d = A[j - 2]

0x0000000002b53a19: cmp    %r10d,%r9d          ; if (r9d > r10d) {
0x0000000002b53a1c: jle    0x0000000002b53973
0x0000000002b53a22: mov    %r10d,0x8(%rbx)     ;     A[j - 2] = r10d
0x0000000002b53a26: mov    %r9d,0xc(%rbx)      ;     A[j - 1] = r9d    
                                               ; }
0x0000000002b53a2a: mov    0x8(%rbx),%r8d      ; r8d = A[j - 2]
0x0000000002b53a2e: mov    0x4(%rbx),%r10d     ; r10d = A[j - 3]

Note, that if you run JVM with loop unrolling optimization disabled (-XX:LoopUnrollLimit=0), the performance of both cases will be the same.

P.S. Full disassembly of both methods is available here, obtained using
-XX:CompileOnly=Test -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly

Breakfront answered 8/10, 2016 at 2:23 Comment(8)

Well. Something even stranger has happened. I ran the same testing code today first with no parameters and my results were swapped: My insertion sort took 56634.0 milliseconds. Other insertion sort took 146979.0 milliseconds. As well, when I ran with the -XX:LoopUnrollLimit=0, this only made mine slower: My insertion sort took 75795.0 milliseconds. Other insertion sort took 141911.0 milliseconds. The code didn't change. My code is still the int temp = A[j]. I used the javac compiler for those tests. – Allround 8/10, 2016 at 11:46

@DylanSiegler Never benchmark two algorithms in one method or even in one JVM. This may lead to unpredictable results. Here is a good article describing why this works wrong. – Breakfront 8/10, 2016 at 12:4

When I separated the tests into 2 files and ran them the same results persisted. My code ran a bit faster (49.7 seconds) and the other code also ran a bit faster (111.6 seconds) but that is still a drastic difference between the 2 methods. – Allround 8/10, 2016 at 12:16

@DylanSiegler Yes, that's exactly what I observe, too. The above answer explains the difference. – Breakfront 8/10, 2016 at 13:20

I want to believe your explanation, but I ran it wth the -XX:LoopUnrollLimit=0 option in separate files (testing code is in the question and updated), but still I have found my code takes ~30 seconds, (6 seconds faster than with loop unrolling) the other takes ~90 seconds (4 seconds slower than with loop unrolling). – Allround 8/10, 2016 at 14:2

@DylanSiegler Seems like the option is not applied. May be you run it the wrong way - I don't know. If you wish to investigate this further, try to get the -XX:+PrintAssembly dump of the generated code. – Breakfront 8/10, 2016 at 19:51

How can I see what parameters I'm running the code with? Currently I do a regular compile then do java Tester1 -XX:LoopUnrollLimit=0 – Allround 8/10, 2016 at 20:0

@DylanSiegler Ah, I see. JVM options should go before main class name. Everything afer main class are application arguments. Try java -XX:LoopUnrollLimit=0 Tester1 – Breakfront 8/10, 2016 at 20:6

You will not see anything in the Java bytecode to explain. Things like this can be the effect of either the machine instructions (as created by the just-in-time JIT compiler) or even the microprocessor (particularly the effect of one of the processor caches).

One thing to note here is that in this sort algorithm, for each iteration of the loop, only A[j] needs to be loaded since A[j-1] was loaded in the last iteration. So the most recently loaded value was A[j] (assuming the fact that I just stated has been taken advantage at some level). Therefore, within the loop, the algorithm that stores A[j] first may behave differently because that values was most recently loaded by the loop check, so that value is more likely to remain in a register or in a processor cache, or be subject to just a single memory load due to optimizations in the JIT-generated machine code.

It is true (as the above answer demonstrates) that is very difficult to know exactly what series of machine instructions are being executed (the JIT has several different levels of compilation, each with different optimizations, based on how often a method is executed, and there are many different JIT optimizations which may interact resulting in different machine code). It is also difficult to know which memory accesses must go to RAM and which are avoided by memory cache hits in the processor. No doubt some combination of the above effects are affecting the performance here. The order of an algorithm has the most effect on performance, but beyond that, there are many optimizations in the JIT and the microprocessor that can affect performance, not all of them obvious to predict.

My instinct is that this is the result of the processor's memory cache working better with your sequence of operations, in part due to the fact that only A[j] needs to be loaded with each loop iteration.

Fellow answered 9/10, 2016 at 17:4 Comment(0)

-1

TL;DR

Your experiment isn't valid, there are many variables that may affect the result. It's best that you utilize a microbenchmarking tool such as Caliper or JMH. I used such a tool for checking which method is faster to create indentation

The difference between yours and your professor's is negligible.

The experiment

For my experiment I had 745,038 data points. I created 3 tests, yours, your instructor's version and Arrays.sort() that is part of the JDK.

https://microbenchmarks.appspot.com/runs/8b8c0554-d3f1-4339-af5a-fdffd18dd053

Based on the results your runtime was: 1,419,867.808 ns Your instructor was: 1,429,798.824 ns

So we're talking about 0.01 ms.

The instructor's just had less variance between the runs.

The JDK Arrays.sort() was slower by a larger magnitude at 1,779,042.513 ns which is 0.300ms slower than yours.

Here is the code I used to do the microbenchmark in Caliper below.

package net.trajano.caliper.test;

import java.io.DataInputStream;
import java.io.EOFException;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

import com.google.caliper.BeforeExperiment;
import com.google.caliper.Benchmark;
import com.google.caliper.api.VmOptions;
import com.google.caliper.runner.CaliperMain;

@VmOptions("-XX:-TieredCompilation")
public class SortBenchmark {

    public static int[] insertionSort(final int[] A) {

        // Check for illegal cases
        if (A == null || A.length == 0) {

            throw new IllegalArgumentException("A is not populated");

        }

        for (int i = 0; i < A.length; i++) {

            int j = i;

            while (j > 0 && A[j - 1] > A[j]) {

                final int temp = A[j - 1];
                A[j - 1] = A[j];
                A[j] = temp;

                j--;

            }

        }

        return A;

    }

    public static int[] insertionSortInstructor(final int[] A) {

        // Check for illegal cases
        if (A == null || A.length == 0) {

            throw new IllegalArgumentException("A is not populated");

        }

        for (int i = 0; i < A.length; i++) {

            int j = i;

            while (j > 0 && A[j - 1] > A[j]) {

                final int temp = A[j];
                A[j] = A[j - 1];
                A[j - 1] = temp;

                j--;

            }

        }

        return A;

    }

    @BeforeExperiment
    void setUp() throws IOException {
        try (final DataInputStream dis = new DataInputStream(
                Files.newInputStream(Paths.get("C:/Program Files/iTunes/iTunes.exe")))) {
            final List<Integer> list = new ArrayList<Integer>();
            while (true) {
                try {
                    list.add(dis.readInt());
                } catch (final EOFException e) {
                    break;
                }
            }

            data = list.stream().mapToInt(i -> i).toArray();
            System.out.println("Data size = " + data.length);
        }
    }

    // data to sort
    private static int[] data;

    @Benchmark
    public void insertionSort(final int reps) {
        for (int i = 0; i < reps; i++) {
            insertionSort(data);
        }
    }

    @Benchmark
    public void insertionSortInstructor(final int reps) {
        for (int i = 0; i < reps; i++) {
            insertionSortInstructor(data);
        }
    }

    @Benchmark
    public void jdkSort(final int reps) {
        for (int i = 0; i < reps; i++) {
            Arrays.sort(data);
        }
    }

    public static void main(final String[] args) {
        CaliperMain.main(SortBenchmark.class, args);
    }
}

Aside

To be honest I was surprised at the result, that the JDK was slower. So I took a look at the source code. It appears there's three sorting algorithms being used by the JDK depending on the thresholds (merge sort, quicksort for less than 286 elements and insertion sort for less than 47 elements).

Since the data set I had was pretty large to begin with, merge sort goes first which has O(n) space complexity in the form of a second copy of the array. As such it is likely extra heap allocations that are causing the extra time.

Contempt answered 8/10, 2016 at 2:19 Comment(1)

I just ran my testing code again on MacOS 10.11 Java 8u77b03 using the javac compiler and these were my results (what's even stranger is now my code is running significantly faster than my instructor's): My insertion sort took 56634.0 milliseconds. Other insertion sort took 146979.0 milliseconds. – Allround 8/10, 2016 at 11:52

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

TL;DR

The experiment

Aside

Recommended topics

Hot tags