Execution of bubble sort is 5 times slower with --indy
Asked Answered
K

1

7

I wrote an implementation of bubble sort to play around a bit with Groovy and see if --indy have any noticeable effect on the performance.

Essentially, it sorts a list of one thousand random integers one thousand times and measures the average execution time for sorting the list.

Half of the time the list is an Integer[], the other half it's an ArrayList<Integer>.

The results are really confusing me:

$ groovyc BubbleSort.groovy
$ time java -cp ~/.gvm/groovy/current/embeddable/groovy-all-2.4.3.jar:. BubbleSort
Average: 22.926ms
Min: 11.202ms
[...] 26.48s user 0.84s system 109% cpu 25.033 total

$ groovyc --indy BubbleSort.groovy
$ time java -cp ~/.gvm/groovy/current/embeddable/groovy-all-2.4.3-indy.jar:. BubbleSort
Average: 119.766ms
Min: 68.251ms
[...] 166.05s user 1.52s system 135% cpu 2:03.82 total

Looking at CPU usage when the benchmarks are running, CPU usage is a lot higher when compiled with --indy than without.

Cpu Usage

This intrigued me so I ran the benchmarks again - but this time with the Yourkit agent and CPU tracing enabled. Here are the recorded call trees:

Without --indy: Call tree without <code>--indy</code>

With --indy: Call tree with <code>--indy</code>

And here are the performance charts - note that the time scale is different because the --indy code is so much slower.

Without --indy (1s scale): Performance without <code>--indy</code>

With --indy (60s scale): Performance with <code>--indy</code>

As can be seen, the CPU usage stabilises at 100% of one core (12.5% in the graph) when compiled without --indy but varies wildly between 12.5% and ~35% when compiled with --indy. What's even more confusing is that Yourkit only reports one live thread (and my code only uses the main thread), but it still manages to keep two and a half cores occupied.

The code compiled with --indy also uses a lot of kernel time at the start, though this drops and stabilises at 0% after a while - at which point the code seems to speed up a bit (heap usage growth-rate increases) and CPU usage increases.

Can anyone explain this behaviour to me?

Versions:

$ groovy -v
Groovy Version: 2.4.3 JVM: 1.8.0_45 Vendor: Oracle Corporation OS: Linux

$ java -version
java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)

BubbleSort.groovy:

class BubbleSort {
    final def array

    BubbleSort(final def array) {
        this.array = array
    }

    private void swap(int a, int b) {
        def tmp = array[a];
        array[a] = array[b]
        array[b] = tmp;
    }

    private void rise(int index) {
        for(int i = index; i > 0; i--) {
            if(array[i] < array[i - 1]) {
                swap(i, i-1)
            } else {
                break
            }
        }
    }

    void sort() {
        for(int i = 1; i < array.size(); i++) {
            rise i
        }
    }

    final static Random random = new Random()

    static void main(String[] args) {
        def n = 1000
        def size = 1000
        // Warm up
        doBenchmark 100, size
        def results = doBenchmark n, size
        printf("Average: %.3fms%n", results.total / 1e6 / n)
        printf("Min: %.3fms%n", results.min / 1e6)
    }

    private static def doBenchmark(int n, int size) {
        long total = 0
        long min = Long.MAX_VALUE
        n.times {
            def array = (1..size).collect { random.nextInt() }
            if(it % 2) {
                array = array as Integer[]
            }
            def start = System.nanoTime()
            new BubbleSort<Integer>(array).sort()
            def end = System.nanoTime()
            def time = end - start
            total += time
            min = Math.min min, time
        }
        return [total: total, min: min]
    }
}

I'm not interested in optimisations on my bubble sort implementation unless they are related to invokedynamic behaviour - the aim here isn't to write the best performing bubble sort possible but to come to grips with why --indy has such a large negative performance impact.

Update:

I converted my code to JRuby and tried the same thing and the results are similar, although JRuby isn't nearly as quick without invokedynamic:

$ JAVA_OPTS="-Djruby.compile.invokedynamic=false" jruby bubblesort.rb
Average: 78.714ms
Min: 35.000ms

$ JAVA_OPTS="-Djruby.compile.invokedynamic=true" jruby bubblesort.rb
Average: 136.287ms
Min: 92.000ms

Update 2:

If I remove the code that changes the list to an Integer[] half the time performance increases significantly, though it's still faster without --indy:

$ groovyc BubbleSort.groovy
$ java -cp ~/.gvm/groovy/current/embeddable/groovy-all-2.4.3.jar:. BubbleSort
Average: 29.794ms
Min: 26.577ms

$ groovyc --indy BubbleSort.groovy
$ java -cp ~/.gvm/groovy/current/embeddable/groovy-all-2.4.3-indy.jar:. BubbleSort
Average: 37.506ms
Min: 33.555ms

If I do the same with JRuby, invokedynamic is faster:

$ JAVA_OPTS="-Djruby.compile.invokedynamic=false" jruby bubblesort.rb
Average: 34.388ms
Min: 31.000ms

$ JAVA_OPTS="-Djruby.compile.invokedynamic=true" jruby bubblesort.rb
Average: 20.785ms
Min: 18.000ms
Knowlton answered 23/4, 2015 at 3:10 Comment(5)
ran your first example with 1.8.0_31 and 1.7.0_76; they both run equally fast (indy slightly faster with 1.8, slightly slower with 1.7)Marceau
Your warmup doesn’t do the same as your measured code, so it’s not sufficient. Once you fixed that, I suggest doing also runs with 10000 elements and 100000 elements and l see how it develops…Actino
Moved the benchmark to a separate method so the warmup and benchmark are executing the same code - no change at all.Knowlton
Doubling the size of the list has the expected result of quadrupling the time for both benchmarks - will leave a longer benchmark running overnight to see the effect on --indy with more items (I'll try to figure out a good number that will make it theoretically finish shortly before I arrive at work tomorrow morning)Knowlton
Performance without --indy on 1.7.0.80 is unchanged. With --indy it gets significantly slower after a few hundred iterations (we're talking order of magnitudes).Knowlton
M
7

The answer is easy actually quite easy, Groovy does not have a PIC yet.

... or you can say that we usually have an inline cache of size 1. This means every time you change the list array type, it will invalidate all caches that did exist before and the cached version is thrown away. That is for normal Groovy almost the same as for indy, only that normal Groovy uses runtime generated classes and indy uses invokedynamic/lambda forms. But lambda forms are initially slower, while peak performance is better. Basically what you do is let hotspot start from scratch for most of the method calls, preventing it to apply the optimizations all the time. Of course that is not your fault, but the fault of Groovy for not having a PIC yet. and just to make this very clear... this is no problem of the language, it is simply something I did not yet get to implement.

JRuby on the other hand has a PIC and thus has not to suffer from the overhead of creating new method handles all the time.

Maternity answered 23/4, 2015 at 12:39 Comment(4)
That makes sense, but I'm still confused as to why --indy is still slower when I don't change the list type. Surely, HotSpot should be optimized enough after the warm up of 100 rounds (which calls DefaultGroovyMethods#getAt(List<T>, int) and putAt(List<T>, int, T) about 50 million times) to allow --indy to be faster?Knowlton
I make the suggestion that you run the program with --disableopt=all and compare. This will disable primitive optimizations that sure contribute to better performance in your program for at least the calls rise and swap. Primopts can, under certain circumstances, add guarded direct method calls. While the guard costs, the direct method call can make things much faster, than usual. I usually also run without tiered compilation (-XX:-TieredCompilation), since it increases the warmup times. After these changes indy should be faster or at least equally fast.Maternity
in the long run I may add some primopts, or a modified version of it to indy as wellMaternity
With -XX:-TieredCompilation and --disableopt=all execution times are close enough that I'd consider them equal. ~49.8 with --indy and ~49.2 without. Considering only ran 1000 iterations I'd say that's close enough.Knowlton

© 2022 - 2024 — McMap. All rights reserved.