Comparing performance of Collectors.summingLong and Collectors.counting
Asked Answered
C

1

9

Benchmarks are run under intel core i5, Ubuntu

java version "1.8.0_144"
Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)

I'm comparing performance of Collectors.counting and Collectors.summingLong(x -> 1L). Here is the benchmark:

public List<Integer> ints = new ArrayList<>();

Collector<Integer, ?, Long> counting = Collectors.counting();
Collector<Integer, ?, Long> summingLong = Collectors.summingLong(x -> 1L);

@Setup
public void setup() throws Exception{
    ints.add(new Random().nextInt(1000));
    ints.add(new Random().nextInt(1000));
    ints.add(new Random().nextInt(1000));
    ints.add(new Random().nextInt(1000));
    ints.add(new Random().nextInt(1000));
    ints.add(new Random().nextInt(1000));
    ints.add(new Random().nextInt(1000));
    ints.add(new Random().nextInt(1000));
    ints.add(new Random().nextInt(1000));
    ints.add(new Random().nextInt(1000));
}

@Benchmark
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
public Long counting() {
    return ints.stream().collect(counting);
}

@Benchmark
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
public Long summingLong() {
    return ints.stream().collect(summingLong);
}

I got the result that Collectors.counting 3 times slower Collectors.summingLong.

So I ran it with -prof perfnorm with 25 forks. Here is the result:

Benchmark                                        Mode  Cnt    Score     Error  Units
MyBenchmark.counting                             avgt  125   87.115 ±   3.787  ns/op
MyBenchmark.counting:CPI                         avgt   25    0.310 ±   0.011   #/op
MyBenchmark.counting:L1-dcache-load-misses       avgt   25    1.963 ±   0.171   #/op
MyBenchmark.counting:L1-dcache-loads             avgt   25  258.716 ±  41.458   #/op
MyBenchmark.counting:L1-dcache-store-misses      avgt   25    1.890 ±   0.168   #/op
MyBenchmark.counting:L1-dcache-stores            avgt   25  131.344 ±  16.433   #/op
MyBenchmark.counting:L1-icache-load-misses       avgt   25    0.035 ±   0.007   #/op
MyBenchmark.counting:LLC-loads                   avgt   25    0.411 ±   0.143   #/op
MyBenchmark.counting:LLC-stores                  avgt   25    0.007 ±   0.001   #/op
MyBenchmark.counting:branch-misses               avgt   25    0.029 ±   0.017   #/op
MyBenchmark.counting:branches                    avgt   25  139.669 ±  21.901   #/op
MyBenchmark.counting:cycles                      avgt   25  261.967 ±  29.392   #/op
MyBenchmark.counting:dTLB-load-misses            avgt   25    0.036 ±   0.003   #/op
MyBenchmark.counting:dTLB-loads                  avgt   25  258.111 ±  41.008   #/op
MyBenchmark.counting:dTLB-store-misses           avgt   25    0.001 ±   0.001   #/op
MyBenchmark.counting:dTLB-stores                 avgt   25  131.394 ±  16.166   #/op
MyBenchmark.counting:iTLB-load-misses            avgt   25    0.001 ±   0.001   #/op
MyBenchmark.counting:iTLB-loads                  avgt   25    0.001 ±   0.001   #/op
MyBenchmark.counting:instructions                avgt   25  850.262 ± 113.228   #/op
MyBenchmark.counting:stalled-cycles-frontend     avgt   25   48.493 ±   8.968   #/op
MyBenchmark.summingLong                          avgt  125   37.238 ±   0.194  ns/op
MyBenchmark.summingLong:CPI                      avgt   25    0.311 ±   0.002   #/op
MyBenchmark.summingLong:L1-dcache-load-misses    avgt   25    1.793 ±   0.013   #/op
MyBenchmark.summingLong:L1-dcache-loads          avgt   25   93.785 ±   0.640   #/op
MyBenchmark.summingLong:L1-dcache-store-misses   avgt   25    1.727 ±   0.013   #/op
MyBenchmark.summingLong:L1-dcache-stores         avgt   25   56.249 ±   0.408   #/op
MyBenchmark.summingLong:L1-icache-load-misses    avgt   25    0.020 ±   0.003   #/op
MyBenchmark.summingLong:LLC-loads                avgt   25    0.843 ±   0.117   #/op
MyBenchmark.summingLong:LLC-stores               avgt   25    0.004 ±   0.001   #/op
MyBenchmark.summingLong:branch-misses            avgt   25    0.008 ±   0.002   #/op
MyBenchmark.summingLong:branches                 avgt   25   61.472 ±   0.260   #/op
MyBenchmark.summingLong:cycles                   avgt   25  110.949 ±   0.784   #/op
MyBenchmark.summingLong:dTLB-load-misses         avgt   25    0.031 ±   0.001   #/op
MyBenchmark.summingLong:dTLB-loads               avgt   25   93.662 ±   0.616   #/op
MyBenchmark.summingLong:dTLB-store-misses        avgt   25   ≈ 10⁻³             #/op
MyBenchmark.summingLong:dTLB-stores              avgt   25   56.302 ±   0.351   #/op
MyBenchmark.summingLong:iTLB-load-misses         avgt   25    0.001 ±   0.001   #/op
MyBenchmark.summingLong:iTLB-loads               avgt   25   ≈ 10⁻³             #/op
MyBenchmark.summingLong:instructions             avgt   25  357.029 ±   1.712   #/op
MyBenchmark.summingLong:stalled-cycles-frontend  avgt   25   10.074 ±   1.096   #/op

What I noticed is this:

branches, instructions, cycles differs almost 3 times. Also cache operations. Branches seems well predicted, also not too much cache misses (only my opinion).

So the issue might be with the compiled code. Ran it with -prof perfasm (too long to put it here).

In compiled code I noticed the following:

I. Collectors.summingLong assembly

We have 3 loops iterating through the array and counting. First counts just one element

0x00007f9abd226dfd: mov %edi,%ebp ;contains the iteration index
incl %ebp
;...
0x00007f9abd226e27: incl %edi
0x00007f9abd226e29: cmp %ebp,%edi
0x00007f9abd226e2b: jnl 0x7f9abd226e34 ;exit after the first iteration

Second counts 4 elements for 1 iteration (Is that loop unrolling?) and also exit after the first one.

0x00007f9abd226ea6: add $0x1,%rsi 
;...
0x00007f9abd226ed0: add $0x2,%rsi
;...
0x00007f9abd226efa: add $0x3,%rsi
;...
0x00007f9abd226f1c: add $0x4,%rbx
;...
0x00007f9abd226f20: mov %rbx,0x10(%r14)

And the third counts the rest of elements.

II. Collectors.counting assembly

We have only one loop counting all elements one-by-one (not unrolled). Also we have inlined boxing conversion inside the loop of the result of counting. Also we seem to have not inlined boxing conversion inside the loop

0x00007f80dd22dfb5: mov $0x1,%esi
0x00007f80dd22dfba: nop
0x00007f80dd22dfbb: callq 0x7f80dd046420

which seems performs boxing of the 1L provided in lambda e -> 1L. But this is not clear why. When performing the actual addition we have this code:

0x00007f80dd22dfec: mov $0x1,%r10d
0x00007f80dd22dff2: add 0x10(%r11),%r10

Also, we store the result of counting inside the stack mov %r10d,0x10(%rsp) instead of the heap like in the first case.

If my undersatnding of what's going on is correct I have

QUESTION: Is that loop unrolling with boxing conversions caused 3 times slowdown? If so, why didn't runtime unroll the loop in the counting case?

Notice that Collectors.summingLong has 2.5 more GC pressure than Collectors.counting. This is not quite clear (I could only guess that we store intermidiate values in the stack in Collectors.counting).

MyBenchmark.counting                                      avgt    5    96.956 ±   4.412   ns/op
MyBenchmark.counting:·gc.alloc.rate                       avgt    5   734.538 ±  33.083  MB/sec
MyBenchmark.counting:·gc.alloc.rate.norm                  avgt    5   112.000 ±   0.001    B/op
MyBenchmark.counting:·gc.churn.PS_Eden_Space              avgt    5   731.423 ± 340.767  MB/sec
MyBenchmark.counting:·gc.churn.PS_Eden_Space.norm         avgt    5   111.451 ±  48.411    B/op
MyBenchmark.counting:·gc.churn.PS_Survivor_Space          avgt    5     0.017 ±   0.067  MB/sec
MyBenchmark.counting:·gc.churn.PS_Survivor_Space.norm     avgt    5     0.003 ±   0.010    B/op
MyBenchmark.counting:·gc.count                            avgt    5    16.000            counts
MyBenchmark.counting:·gc.time                             avgt    5    12.000                ms
MyBenchmark.summingLong                                   avgt    5    38.371 ±   1.733   ns/op
MyBenchmark.summingLong:·gc.alloc.rate                    avgt    5  1856.581 ±  81.706  MB/sec
MyBenchmark.summingLong:·gc.alloc.rate.norm               avgt    5   112.000 ±   0.001    B/op
MyBenchmark.summingLong:·gc.churn.PS_Eden_Space           avgt    5  1876.736 ± 192.503  MB/sec
MyBenchmark.summingLong:·gc.churn.PS_Eden_Space.norm      avgt    5   113.213 ±   9.916    B/op
MyBenchmark.summingLong:·gc.churn.PS_Survivor_Space       avgt    5     0.033 ±   0.072  MB/sec
MyBenchmark.summingLong:·gc.churn.PS_Survivor_Space.norm  avgt    5     0.002 ±   0.004    B/op
MyBenchmark.summingLong:·gc.count                         avgt    5    62.000            counts
MyBenchmark.summingLong:·gc.time                          avgt    5    48.000                ms
Cargile answered 30/1, 2018 at 12:5 Comment(13)
Could you try final Long one = Long.valueOf(1L); Collector<Integer, ?, Long> counting = Collectors.reducing(0L, e -> one, Long::sum); and see if it makes any difference? This is an implementation of Collectors.counting(...) that boxes 1L upfront. See if it makes any difference.Apeldoorn
@dasblinkenlight You mean callq 0x7f80dd046420 is not the boxing as I thought? So what is that?Cargile
I think it's boxing. I took implementation of Collectors.counting from here, and replaced e -> 1L with e -> one, where one is a pre-boxed value of 1L.Apeldoorn
@dasblinkenlight Okay, got it, thank you. The question is why runtime put it here (not inlined) if we perform addition with immediate mov $0x1,%r10d add 0x10(%r11),%r10 later on?Cargile
Note that it points out in the new edition of Effective Java (in Item 46, p214) "The collectors returned by the counting method are intended only for use as downstream collectors. The same functionality is available directly on Stream, via the count method, so there is never a reason to say collect(counting())." In other words, you're not meant to do this :) probably for the reasons that you're describing here.Grantor
@AndyTurner Agree, but was just curious to investigate what was actually going on. :)Cargile
I never thought you would actually post this as a question. :) good job!Mobility
@AndyTurner great pointer to Effective Java! I did quick benchmarking against int[1000] rather than a list to remove the cache factor, and using simple Stream.count still results in slower processing than when using Collectors.summingLong() ?!Explode
@Mobility Actaully, what I tried first was to attach to running benchmarks with gdb, but this seems not easy. In case of a regular java app I can set compiler threshold and and wait till the method is compiled then sleep and that's it. But in jmh the assembly is printed after it's finished. Maybe you know the way how to use gdb with jmh?Cargile
@Mobility Anyway would like to understand the difference in GC pressure. I still think that it happened because of a stack was heavily used in the counting case... Can it be the reason of such difference?Cargile
@Cargile well, IIRC JMH will fail if there is a major collection (but Im might be wrong...), what I do to get away from that is run each benchmark with jvmArgs with as much heap as I can allocate. about gdb I've never even once played with it, so this goes out of my league :(Mobility
This is a known issue fixed in JDK-9: bugs.openjdk.java.net/browse/JDK-8136686Hesperian
@Hesperian This is not related to the performance difference... But do you have any idea about why runtime didn't unroll the loop in the case of Collectors.counting?Cargile
V
8

I didn't look at the assembly or analyze it but a look at the sources already provides some information.

summingLong() results in this collector:

new CollectorImpl<>(
            () -> new long[1],
            (a, t) -> { a[0] += mapper.applyAsLong(t); },
            (a, b) -> { a[0] += b[0]; return a; },
            a -> a[0], CH_NOID);

counting() results in this:

new CollectorImpl<>(
            boxSupplier(identity),
            (a, t) -> { a[0] = op.apply(a[0], mapper.apply(t)); },
            (a, b) -> { a[0] = op.apply(a[0], b[0]); return a; },
            a -> a[0], CH_NOID);

As you can see, the counting() version is doing some more things:

  • boxing
  • calls to op.apply(...)

Since op is Long::sum which operates on primitives there's a lot of boxing and unboxing invoved.

Vegetarian answered 30/1, 2018 at 12:17 Comment(5)
Thank you for the asnwer. Yes, I looked at this code and noticed the boxing conversion. But.... If we run the benchmark with -prof gc we can notice that allocation rate in the summingLong case is 2.5 times more 1800 vs 700 MB/secCargile
@Vegetarian it's interesting that in java-9, they changed Collectors.counting to summingLong(x -> 1L)Mobility
@Mobility interesting indeed, maybe they realized the difference ;)Vegetarian
@Vegetarian Actaully, agree. I just tried the benchmarks with -XX:LoopUnrollLimit=1 and verified that the loop was not unrolled anymore in the summingLong case. Did not get much performance difference. Seems that boxing has the greatest performance impact.Cargile
The problem was that Collectors.counting() did unnecessary boxing in JDK 8. This issue was discussed on the mailing list, and then fixed in JDK 9, see JDK-8136686Hesperian

© 2022 - 2024 — McMap. All rights reserved.