Java 8 stream unpredictable performance drop with no obvious reason
Asked Answered
F

1

59

I am using Java 8 streams to iterate over a list with sublists. The outer list size varies between 100 to 1000 (different test runs) and the inner list size is always 5.

There are 2 benchmark runs which show unexpected performance deviations.

package benchmark;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;

import java.io.IOException;
import java.util.concurrent.ThreadLocalRandom;
import java.util.*;
import java.util.function.*;
import java.util.stream.*;

@Threads(32)
@Warmup(iterations = 25)
@Measurement(iterations = 5)
@State(Scope.Benchmark)
@Fork(1)
@BenchmarkMode(Mode.Throughput)
public class StreamBenchmark {
    @Param({"700", "600", "500", "400", "300", "200", "100"})
    int outerListSizeParam;
    final static int INNER_LIST_SIZE = 5;
    List<List<Integer>> list;

    Random rand() {
        return ThreadLocalRandom.current();
    }

    final BinaryOperator<Integer> reducer = (val1, val2) -> val1 + val2;

    final Supplier<List<Integer>> supplier = () -> IntStream
            .range(0, INNER_LIST_SIZE)
            .mapToObj(ptr -> rand().nextInt(100))
            .collect(Collectors.toList());

    @Setup
    public void init() throws IOException {
        list = IntStream
                .range(0, outerListSizeParam)
                .mapToObj(i -> supplier.get())
                .collect(Collectors.toList());
    }

    @Benchmark
    public void loop(Blackhole bh) throws Exception {
        List<List<Integer>> res = new ArrayList<>();
        for (List<Integer> innerList : list) {
            if (innerList.stream().reduce(reducer).orElse(0) == rand().nextInt(2000)) {
                res.add(innerList);
            }
        }

        bh.consume(res);
    }

    @Benchmark
    public void stream(Blackhole bh) throws Exception {
        List<List<Integer>> res = list
                .stream()
                .filter(innerList -> innerList.stream().reduce(reducer).orElse(0) == rand().nextInt(2000))
                .collect(Collectors.toList());

        bh.consume(res);
    }
}

Run 1

    Benchmark               (outerListSizeParam)   Mode  Cnt        Score        Error  Units
    StreamBenchmark.loop                     700  thrpt    5    22488.601 ?   1128.543  ops/s
    StreamBenchmark.loop                     600  thrpt    5    26010.430 ?   1161.854  ops/s
    StreamBenchmark.loop                     500  thrpt    5   361837.395 ?  12777.016  ops/s
    StreamBenchmark.loop                     400  thrpt    5   451774.457 ?  22517.801  ops/s
    StreamBenchmark.loop                     300  thrpt    5   744677.723 ?  23456.913  ops/s
    StreamBenchmark.loop                     200  thrpt    5  1102075.707 ?  38678.994  ops/s
    StreamBenchmark.loop                     100  thrpt    5  2334981.090 ? 100973.551  ops/s
    StreamBenchmark.stream                   700  thrpt    5    22320.346 ?    496.432  ops/s
    StreamBenchmark.stream                   600  thrpt    5    26091.609 ?   1044.868  ops/s
    StreamBenchmark.stream                   500  thrpt    5    31961.096 ?    497.854  ops/s
    StreamBenchmark.stream                   400  thrpt    5   377701.859 ?  11115.990  ops/s
    StreamBenchmark.stream                   300  thrpt    5    53887.652 ?   1228.245  ops/s
    StreamBenchmark.stream                   200  thrpt    5    78754.294 ?   2173.316  ops/s
    StreamBenchmark.stream                   100  thrpt    5  1564899.788 ?  47369.698  ops/s

Run 2

    Benchmark               (outerListSizeParam)   Mode  Cnt        Score       Error  Units
    StreamBenchmark.loop                    1000  thrpt   10    16179.702 ?   260.134  ops/s
    StreamBenchmark.loop                     700  thrpt   10    22924.319 ?   329.134  ops/s
    StreamBenchmark.loop                     600  thrpt   10    26871.267 ?   416.464  ops/s
    StreamBenchmark.loop                     500  thrpt   10   353043.221 ?  6628.980  ops/s
    StreamBenchmark.loop                     300  thrpt   10   772234.261 ? 10075.536  ops/s
    StreamBenchmark.loop                     100  thrpt   10  2357125.442 ? 30824.834  ops/s
    StreamBenchmark.stream                  1000  thrpt   10    15526.423 ?   147.454  ops/s
    StreamBenchmark.stream                   700  thrpt   10    22347.898 ?   117.360  ops/s
    StreamBenchmark.stream                   600  thrpt   10    26172.790 ?   229.745  ops/s
    StreamBenchmark.stream                   500  thrpt   10    31643.518 ?   428.680  ops/s
    StreamBenchmark.stream                   300  thrpt   10   536037.041 ?  6176.192  ops/s
    StreamBenchmark.stream                   100  thrpt   10   153619.054 ?  1450.839  ops/s

I have two questions:

  1. Why is there a consistent, significant performance difference between loop+500 and loop+600 for both test runs?
  2. Why in Run1 stream+400 and Run2 stream+300 is there a significant but inconsistent performance deviation?

It looks like the JIT sometimes makes suboptimal optimization decisions, causing a huge performance drop.

The test machine has 128GB RAM and 32 CPU cores:

    java version "1.8.0_45"
    Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
    Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)

    Architecture:          x86_64
    CPU op-mode(s):        32-bit, 64-bit
    Byte Order:            Little Endian
    CPU(s):                32
    On-line CPU(s) list:   0-31
    Thread(s) per core:    2
    Core(s) per socket:    8
    Socket(s):             2
    NUMA node(s):          2
    Vendor ID:             GenuineIntel
    CPU family:            6
    Model:                 62
    Model name:            Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
    Stepping:              4
    CPU MHz:               1201.078
    CPU max MHz:           3400.0000
    CPU min MHz:           1200.0000
    BogoMIPS:              5201.67
    Virtualization:        VT-x
    L1d cache:             32K
    L1i cache:             32K
    L2 cache:              256K
    L3 cache:              20480K
    NUMA node0 CPU(s):     0-7,16-23
    NUMA node1 CPU(s):     8-15,24-31

P.S. Added benchmark with no stream. These tests (loop + stream + pureLoop) make me think that using streams and lambdas would require a lot of micro optimisation efforts and does not guarantee consistent performance anyway.

@Benchmark
public void pureLoop(Blackhole bh) throws Exception {
    List<List<Integer>> res = new ArrayList<>();
    for (List<Integer> innerList : list) {
        int sum = 0;
        for (Integer i : innerList) {
            sum += i;
        }
        if (sum == rand().nextInt(2000))
            res.add(innerList);
    }

    bh.consume(res);
}

Run 3 (pure loops)

Benchmark                 (outerListSizeParam)   Mode  Cnt         Score        Error  Units
StreamBenchmark.loop                      1000  thrpt    5     15848.277 ?    445.624  ops/s
StreamBenchmark.loop                       700  thrpt    5     22330.289 ?    484.554  ops/s
StreamBenchmark.loop                       600  thrpt    5     26353.565 ?    631.421  ops/s
StreamBenchmark.loop                       500  thrpt    5    358144.956 ?   8273.981  ops/s
StreamBenchmark.loop                       400  thrpt    5    591471.382 ?  17725.212  ops/s
StreamBenchmark.loop                       300  thrpt    5    785458.022 ?  23775.650  ops/s
StreamBenchmark.loop                       200  thrpt    5   1192328.880 ?  40006.056  ops/s
StreamBenchmark.loop                       100  thrpt    5   2330555.766 ?  73143.081  ops/s
StreamBenchmark.pureLoop                  1000  thrpt    5   1024629.128 ?   4387.106  ops/s
StreamBenchmark.pureLoop                   700  thrpt    5   1495365.029 ?  31659.941  ops/s
StreamBenchmark.pureLoop                   600  thrpt    5   1787432.825 ?  16611.868  ops/s
StreamBenchmark.pureLoop                   500  thrpt    5   2087093.023 ?  20143.165  ops/s
StreamBenchmark.pureLoop                   400  thrpt    5   2662946.999 ?  33326.079  ops/s
StreamBenchmark.pureLoop                   300  thrpt    5   3657830.227 ?  55020.775  ops/s
StreamBenchmark.pureLoop                   200  thrpt    5   5365706.786 ?  64404.783  ops/s
StreamBenchmark.pureLoop                   100  thrpt    5  10477430.730 ? 187641.413  ops/s
StreamBenchmark.stream                    1000  thrpt    5     15576.304 ?    250.620  ops/s
StreamBenchmark.stream                     700  thrpt    5     22286.965 ?   1153.734  ops/s
StreamBenchmark.stream                     600  thrpt    5     26109.258 ?    296.382  ops/s
StreamBenchmark.stream                     500  thrpt    5     31343.986 ?   1270.210  ops/s
StreamBenchmark.stream                     400  thrpt    5     39696.775 ?   1812.355  ops/s
StreamBenchmark.stream                     300  thrpt    5    536932.353 ?  41249.909  ops/s
StreamBenchmark.stream                     200  thrpt    5     77797.301 ?    976.641  ops/s
StreamBenchmark.stream                     100  thrpt    5    155387.348 ?   3182.841  ops/s

Solution: as recommended by apangin disabling tiered compilation made JIT results stable.

java -XX:-TieredCompilation -jar test-jmh.jar

Benchmark                 (outerListSizeParam)   Mode  Cnt         Score        Error  Units
StreamBenchmark.loop                      1000  thrpt    5    160410.288 ?   4426.320  ops/s
StreamBenchmark.loop                       700  thrpt    5    230524.018 ?   4426.740  ops/s
StreamBenchmark.loop                       600  thrpt    5    266266.663 ?   9078.827  ops/s
StreamBenchmark.loop                       500  thrpt    5    324182.307 ?   8452.368  ops/s
StreamBenchmark.loop                       400  thrpt    5    400793.677 ?  12526.475  ops/s
StreamBenchmark.loop                       300  thrpt    5    534618.231 ?  25616.352  ops/s
StreamBenchmark.loop                       200  thrpt    5    803314.614 ?  33108.005  ops/s
StreamBenchmark.loop                       100  thrpt    5   1827400.764 ?  13868.253  ops/s
StreamBenchmark.pureLoop                  1000  thrpt    5   1126873.129 ?  33307.600  ops/s
StreamBenchmark.pureLoop                   700  thrpt    5   1560200.150 ? 150146.319  ops/s
StreamBenchmark.pureLoop                   600  thrpt    5   1848113.823 ?  16195.103  ops/s
StreamBenchmark.pureLoop                   500  thrpt    5   2250201.116 ? 130995.240  ops/s
StreamBenchmark.pureLoop                   400  thrpt    5   2839212.063 ? 142008.523  ops/s
StreamBenchmark.pureLoop                   300  thrpt    5   3807436.825 ? 140612.798  ops/s
StreamBenchmark.pureLoop                   200  thrpt    5   5724311.256 ?  77031.417  ops/s
StreamBenchmark.pureLoop                   100  thrpt    5  11718427.224 ? 101424.952  ops/s
StreamBenchmark.stream                    1000  thrpt    5     16186.121 ?    249.806  ops/s
StreamBenchmark.stream                     700  thrpt    5     22071.884 ?    703.729  ops/s
StreamBenchmark.stream                     600  thrpt    5     25546.378 ?    472.804  ops/s
StreamBenchmark.stream                     500  thrpt    5     32271.659 ?    437.048  ops/s
StreamBenchmark.stream                     400  thrpt    5     39755.841 ?    506.207  ops/s
StreamBenchmark.stream                     300  thrpt    5     52309.706 ?   1271.206  ops/s
StreamBenchmark.stream                     200  thrpt    5     79277.532 ?   2040.740  ops/s
StreamBenchmark.stream                     100  thrpt    5    161244.347 ?   3882.619  ops/s
Fifine answered 6/6, 2015 at 21:24 Comment(8)
When I replaced Collectors.toList with forEach(res::add) at stream they start working with the same performance.Woollen
If you want to investigate codegen, look at -XX:+PrintAssembly. As you're running on a NUMA machine, you could try running only on one socket and see if the trends are the same.Alrick
Make sure to disable turbo boost for benchmarking. It does make a difference if your CPU runs as 2.6 GHz or at 3.4 GHz. Also consider adding a real loop based verison, that also does not use a reducer on the inner lists. ;-)Chiropractor
Not sure if turbo boost could cause 10 times performance gaps, also before each estimate there was jvm warmup for 25 iterations.Fifine
@Woollen - What memory options are you using for your JVM? Try using different options for -Xmx and -Xms and see what happens to your benchmark wnen using Collectors.toList.Caddis
Added test result with -Xms4G, Xmx4G. I have tested with different JIT settings, some of them gave higher probability of performance boost, but it was random success, which could not be predictably reproduced.Fifine
Have you tried disabling JIT (-Xint)?Androgen
Added run with disabled JIT. It equally slows down all tests and avoids random performance gaps.Fifine
P
72

This effect is caused by Type Profile Pollution. Let me explain on a simplified benchmark:

@State(Scope.Benchmark)
public class Streams {
    @Param({"500", "520"})
    int iterations;

    @Setup
    public void init() {
        for (int i = 0; i < iterations; i++) {
            Stream.empty().reduce((x, y) -> x);
        }
    }

    @Benchmark
    public long loop() {
        return Stream.empty().count();
    }
}

Though iteration parameter here changes very slightly and it does not affect the main benchmark loop, the results expose very surprising 2.5x performance degradation:

Benchmark     (iterations)   Mode  Cnt      Score     Error   Units
Streams.loop           500  thrpt    5  29491,039 ± 240,953  ops/ms
Streams.loop           520  thrpt    5  11867,860 ± 344,779  ops/ms

Now let's run JMH with -prof perfasm option to see the hottest code regions:

Fast case (iterations = 500):

....[Hottest Methods (after inlining)]..................................
 48,66%  bench.generated.Streams_loop::loop_thrpt_jmhStub
 23,14%  <unknown>
  2,99%  java.util.stream.Sink$ChainedReference::<init>
  1,98%  org.openjdk.jmh.infra.Blackhole::consume
  1,68%  java.util.Objects::requireNonNull
  0,65%  java.util.stream.AbstractPipeline::evaluate

Slow case (iterations = 520):

....[Hottest Methods (after inlining)]..................................
 40,09%  java.util.stream.ReduceOps$ReduceOp::evaluateSequential
 22,02%  <unknown>
 17,61%  bench.generated.Streams_loop::loop_thrpt_jmhStub
  1,25%  org.openjdk.jmh.infra.Blackhole::consume
  0,74%  java.util.stream.AbstractPipeline::evaluate

Looks like the slow case spends the most time in ReduceOp.evaluateSequential method that is not inlined. Furthermore, if we study the assembly code for this method we'll find that the longest operation is checkcast.

You know how HotSpot compiler works: before the JIT starts, a method is executed in interpreter for some time to collect the profile data, e.g. what methods are called, what classes are seen, what branches are taken etc. With Tiered compilation the profile is also collected in C1-compiled code. The profile is then used to generate C2-optimizied code. However if the application changes execution pattern in the middle, the generated code may be not optimal for the modified behavior.

Let's use -XX:+PrintMethodData (available in debug JVM) to compare the execution profiles:

----- Fast case -----
java.util.stream.ReduceOps$ReduceOp::evaluateSequential(Ljava/util/stream/PipelineHelper;Ljava/util/Spliterator;)Ljava/lang/Object;
  interpreter_invocation_count:    13382 
  invocation_counter:              13382 
  backedge_counter:                    0 
  mdo size: 552 bytes

0 aload_1
1 fast_aload_0
2 invokevirtual 3 <java/util/stream/ReduceOps$ReduceOp.makeSink()Ljava/util/stream/ReduceOps$AccumulatingSink;> 
  0   bci: 2    VirtualCallData     count(0) entries(1)
                                    'java/util/stream/ReduceOps$8'(12870 1.00)
5 aload_2
6 invokevirtual 4 <java/util/stream/PipelineHelper.wrapAndCopyInto(Ljava/util/stream/Sink;Ljava/util/Spliterator;)Ljava/util/stream/Sink;> 
  48  bci: 6    VirtualCallData     count(0) entries(1)
                                    'java/util/stream/ReferencePipeline$5'(12870 1.00)
9 checkcast 5 <java/util/stream/ReduceOps$AccumulatingSink>
  96  bci: 9    ReceiverTypeData    count(0) entries(1)
                                    'java/util/stream/ReduceOps$8ReducingSink'(12870 1.00)
12 invokeinterface 6 <java/util/stream/ReduceOps$AccumulatingSink.get()Ljava/lang/Object;> 
  144 bci: 12   VirtualCallData     count(0) entries(1)
                                    'java/util/stream/ReduceOps$8ReducingSink'(12870 1.00)
17 areturn

----- Slow case -----
java.util.stream.ReduceOps$ReduceOp::evaluateSequential(Ljava/util/stream/PipelineHelper;Ljava/util/Spliterator;)Ljava/lang/Object;
  interpreter_invocation_count:    54751 
  invocation_counter:              54751 
  backedge_counter:                    0 
  mdo size: 552 bytes

0 aload_1
1 fast_aload_0
2 invokevirtual 3 <java/util/stream/ReduceOps$ReduceOp.makeSink()Ljava/util/stream/ReduceOps$AccumulatingSink;> 
  0   bci: 2    VirtualCallData     count(0) entries(2)
                                    'java/util/stream/ReduceOps$2'(16 0.00)
                                    'java/util/stream/ReduceOps$8'(54223 1.00)
5 aload_2
6 invokevirtual 4 <java/util/stream/PipelineHelper.wrapAndCopyInto(Ljava/util/stream/Sink;Ljava/util/Spliterator;)Ljava/util/stream/Sink;> 
  48  bci: 6    VirtualCallData     count(0) entries(2)
                                    'java/util/stream/ReferencePipeline$Head'(16 0.00)
                                    'java/util/stream/ReferencePipeline$5'(54223 1.00)
9 checkcast 5 <java/util/stream/ReduceOps$AccumulatingSink>
  96  bci: 9    ReceiverTypeData    count(0) entries(2)
                                    'java/util/stream/ReduceOps$2ReducingSink'(16 0.00)
                                    'java/util/stream/ReduceOps$8ReducingSink'(54228 1.00)
12 invokeinterface 6 <java/util/stream/ReduceOps$AccumulatingSink.get()Ljava/lang/Object;> 
  144 bci: 12   VirtualCallData     count(0) entries(2)
                                    'java/util/stream/ReduceOps$2ReducingSink'(16 0.00)
                                    'java/util/stream/ReduceOps$8ReducingSink'(54228 1.00)
17 areturn

You see, the initialization loop ran too long that its statistics appeared in the execution profile: all virtual methods have two implementations and checkcast has also two different entries. In the fast case the profile is not polluted: all sites are monomorphic, and JIT can easily inline and optimize them.

The same is true for your original benchmark: longer stream operations in init() method polluted the profile. If you play with profile and tiered compilation options, the results can be quite different. For example, try

  1. -XX:-ProfileInterpreter
  2. -XX:Tier3InvocationThreshold=1000
  3. -XX:-TieredCompilation

Finally, this problem is not unique. There are already multiple JVM bugs related to performance regressions due to profile pollution: JDK-8015416, JDK-8015417, JDK-8059879... Hope this will be improved in Java 9.

Platinous answered 9/6, 2015 at 1:50 Comment(3)
Damn, that's detailed. You know how HotSpot compiler works Not really, so thanks for the explanation.Cirrus
Could also be interesting to play with -XX:ReservedCodeCacheSizeAndrogen
Thanks for the answer @apangin. Has this been solved with Java 9 though ?Rhabdomancy

© 2022 - 2024 — McMap. All rights reserved.