Execution of bubble sort is 5 times slower with --indy

I wrote an implementation of bubble sort to play around a bit with Groovy and see if --indy have any noticeable effect on the performance.

Essentially, it sorts a list of one thousand random integers one thousand times and measures the average execution time for sorting the list.

Half of the time the list is an Integer[], the other half it's an ArrayList<Integer>.

The results are really confusing me:

$ groovyc BubbleSort.groovy
$ time java -cp ~/.gvm/groovy/current/embeddable/groovy-all-2.4.3.jar:. BubbleSort
Average: 22.926ms
Min: 11.202ms
[...] 26.48s user 0.84s system 109% cpu 25.033 total

$ groovyc --indy BubbleSort.groovy
$ time java -cp ~/.gvm/groovy/current/embeddable/groovy-all-2.4.3-indy.jar:. BubbleSort
Average: 119.766ms
Min: 68.251ms
[...] 166.05s user 1.52s system 135% cpu 2:03.82 total

Looking at CPU usage when the benchmarks are running, CPU usage is a lot higher when compiled with --indy than without.

Cpu Usage

This intrigued me so I ran the benchmarks again - but this time with the Yourkit agent and CPU tracing enabled. Here are the recorded call trees:

Without --indy:

With --indy:

And here are the performance charts - note that the time scale is different because the --indy code is so much slower.

Without --indy (1s scale):

With --indy (60s scale):

As can be seen, the CPU usage stabilises at 100% of one core (12.5% in the graph) when compiled without --indy but varies wildly between 12.5% and ~35% when compiled with --indy. What's even more confusing is that Yourkit only reports one live thread (and my code only uses the main thread), but it still manages to keep two and a half cores occupied.

The code compiled with --indy also uses a lot of kernel time at the start, though this drops and stabilises at 0% after a while - at which point the code seems to speed up a bit (heap usage growth-rate increases) and CPU usage increases.

Can anyone explain this behaviour to me?

Versions:

$ groovy -v
Groovy Version: 2.4.3 JVM: 1.8.0_45 Vendor: Oracle Corporation OS: Linux

$ java -version
java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)

BubbleSort.groovy:

class BubbleSort {
    final def array

    BubbleSort(final def array) {
        this.array = array
    }

    private void swap(int a, int b) {
        def tmp = array[a];
        array[a] = array[b]
        array[b] = tmp;
    }

    private void rise(int index) {
        for(int i = index; i > 0; i--) {
            if(array[i] < array[i - 1]) {
                swap(i, i-1)
            } else {
                break
            }
        }
    }

    void sort() {
        for(int i = 1; i < array.size(); i++) {
            rise i
        }
    }

    final static Random random = new Random()

    static void main(String[] args) {
        def n = 1000
        def size = 1000
        // Warm up
        doBenchmark 100, size
        def results = doBenchmark n, size
        printf("Average: %.3fms%n", results.total / 1e6 / n)
        printf("Min: %.3fms%n", results.min / 1e6)
    }

    private static def doBenchmark(int n, int size) {
        long total = 0
        long min = Long.MAX_VALUE
        n.times {
            def array = (1..size).collect { random.nextInt() }
            if(it % 2) {
                array = array as Integer[]
            }
            def start = System.nanoTime()
            new BubbleSort<Integer>(array).sort()
            def end = System.nanoTime()
            def time = end - start
            total += time
            min = Math.min min, time
        }
        return [total: total, min: min]
    }
}

I'm not interested in optimisations on my bubble sort implementation unless they are related to invokedynamic behaviour - the aim here isn't to write the best performing bubble sort possible but to come to grips with why --indy has such a large negative performance impact.

Update:

I converted my code to JRuby and tried the same thing and the results are similar, although JRuby isn't nearly as quick without invokedynamic:

$ JAVA_OPTS="-Djruby.compile.invokedynamic=false" jruby bubblesort.rb
Average: 78.714ms
Min: 35.000ms

$ JAVA_OPTS="-Djruby.compile.invokedynamic=true" jruby bubblesort.rb
Average: 136.287ms
Min: 92.000ms

Update 2:

If I remove the code that changes the list to an Integer[] half the time performance increases significantly, though it's still faster without --indy:

$ groovyc BubbleSort.groovy
$ java -cp ~/.gvm/groovy/current/embeddable/groovy-all-2.4.3.jar:. BubbleSort
Average: 29.794ms
Min: 26.577ms

$ groovyc --indy BubbleSort.groovy
$ java -cp ~/.gvm/groovy/current/embeddable/groovy-all-2.4.3-indy.jar:. BubbleSort
Average: 37.506ms
Min: 33.555ms

If I do the same with JRuby, invokedynamic is faster:

$ JAVA_OPTS="-Djruby.compile.invokedynamic=false" jruby bubblesort.rb
Average: 34.388ms
Min: 31.000ms

$ JAVA_OPTS="-Djruby.compile.invokedynamic=true" jruby bubblesort.rb
Average: 20.785ms
Min: 18.000ms

The answer is easy actually quite easy, Groovy does not have a PIC yet.

... or you can say that we usually have an inline cache of size 1. This means every time you change the list array type, it will invalidate all caches that did exist before and the cached version is thrown away. That is for normal Groovy almost the same as for indy, only that normal Groovy uses runtime generated classes and indy uses invokedynamic/lambda forms. But lambda forms are initially slower, while peak performance is better. Basically what you do is let hotspot start from scratch for most of the method calls, preventing it to apply the optimizations all the time. Of course that is not your fault, but the fault of Groovy for not having a PIC yet. and just to make this very clear... this is no problem of the language, it is simply something I did not yet get to implement.

JRuby on the other hand has a PIC and thus has not to suffer from the overhead of creating new method handles all the time.

Recommended topics

Hot tags