Why is the sum of reciprocals using a for-loop ~400x faster than streams?
Asked Answered
S

1

8

This code is benchmarking 3 different ways to compute the sum of the reciprocals of the elements of a double[].

  1. a for-loop
  2. Java 8 streams
  3. the colt math library

What is the reason that the computation using a simple for-loop is ~400 times faster than the one using streams? (Or is there anything needs to be improved in the benchmarking code? Or a faster way of computing this using streams?)

Code :

import java.util.Arrays;
import java.util.List;
import java.util.Map;
import java.util.concurrent.TimeUnit;
import java.util.stream.Collectors;
import java.util.stream.IntStream;
import cern.colt.list.DoubleArrayList;
import cern.jet.stat.Descriptive;
import org.openjdk.jmh.annotations.*;

@State(Scope.Thread)
public class MyBenchmark {

    public static double[] array;

    static {
        int num_of_elements = 100;
        array = new double[num_of_elements];
        for (int i = 0; i < num_of_elements; i++) {
            array[i] = i+1;
        }
    }

    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @OutputTimeUnit(TimeUnit.NANOSECONDS)
    public void testInversionSumForLoop(){
        double result = 0;
        for (int i = 0; i < array.length; i++) {
            result += 1.0/array[i];
        }
    }

    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @OutputTimeUnit(TimeUnit.NANOSECONDS)
    public void testInversionSumUsingStreams(){
        double result = 0;
        result = Arrays.stream(array).map(d -> 1/d).sum();
    }

    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @OutputTimeUnit(TimeUnit.NANOSECONDS)
    public void testInversionSumUsingCernColt(){
        double result = Descriptive.sumOfInversions(new DoubleArrayList(array), 0, array.length-1);
    }
}

Results:

/**
 * Results
 * Benchmark                                  Mode  Cnt    Score    Error  Units
 * MyBenchmark.testInversionSumForLoop        avgt  200    1.647 ±  0.155  ns/op
 * MyBenchmark.testInversionSumUsingCernColt  avgt  200  603.254 ± 22.199  ns/op
 * MyBenchmark.testInversionSumUsingStreams   avgt  200  645.895 ± 20.833  ns/o
 */

Update: these results show Blackhome.consume or return is necessary to avoid jvm optimization.

/**
 * Updated results after adding Blackhole.consume
 * Benchmark                                  Mode  Cnt    Score    Error  Units
 * MyBenchmark.testInversionSumForLoop        avgt  200  525.498 ± 10.458  ns/op
 * MyBenchmark.testInversionSumUsingCernColt  avgt  200  517.930 ±  2.080  ns/op
 * MyBenchmark.testInversionSumUsingStreams   avgt  200  582.103 ±  3.261  ns/op
 */

oracle jdk version "1.8.0_181", Darwin Kernel Version 17.7.0

Sathrum answered 12/1, 2019 at 21:13 Comment(3)
JMH does not yet make the benchmark automatically correct. The result of the computation is not used here, and JIT removes the computation altogether in a loop case. You need to "consume" result either by calling Blackhole.consume or by simply adding return result; at the end of benchmark method.Cruciate
1.647ns is too little time to sum 100 numbers, that over 6e+10 operations per second. I think testInversionSumForLoop is being optimized away by the JIT compiler.Elyssa
@Cruciate you are right, I missed Blackhole.consume now results are similar.Sathrum
O
11

In your example JVM most likely optimizes out the loop completely because result value is never read after computation. You should use Blackhole to consume the result like below:

@State(Scope.Thread)
@Warmup(iterations = 10, time = 200, timeUnit = MILLISECONDS)
@Measurement(iterations = 20, time = 500, timeUnit = MILLISECONDS)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class MyBenchmark {

  static double[] array;

  static {
    int num_of_elements = 100;
    array = new double[num_of_elements];
    for (int i = 0; i < num_of_elements; i++) {
      array[i] = i + 1;
    }
  }

  double result = 0;

  @Benchmark
  public void baseline(Blackhole blackhole) {
    result = 1;
    result = result / 1.0;
    blackhole.consume(result);
  }

  @Benchmark
  public void testInversionSumForLoop(Blackhole blackhole) {
    for (int i = 0; i < array.length; i++) {
      result += 1.0 / array[i];
    }
    blackhole.consume(result);
  }

  @Benchmark
  public void testInversionSumUsingStreams(Blackhole blackhole) {
    result = Arrays.stream(array).map(d -> 1 / d).sum();
    blackhole.consume(result);
  }

}

This new benchmark shows difference of 4x which is expected. Loops benefit from a number of optimizations in the JVM and don't involve new objects creation like streams do.

Benchmark                                 Mode  Cnt    Score   Error  Units
MyBenchmark.baseline                      avgt  100    2.437 ±  0.139  ns/op
MyBenchmark.testInversionSumForLoop       avgt  100  135.512 ± 13.080  ns/op
MyBenchmark.testInversionSumUsingStreams  avgt  100  506.479 ±  4.209  ns/o

I attempted to add a baseline to show what is the cost of single operation on my machine. The baseline ns/ops is similar to your loop ns/ops which IMO confirms your loop was optimized out.

I'd love someone to tell me what would be a good baseline for this benchmark scenario.

My environment:

openjdk version "11.0.1" 2018-10-16
OpenJDK Runtime Environment 18.9 (build 11.0.1+13)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.1+13, mixed mode)

Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Odalisque answered 12/1, 2019 at 21:56 Comment(7)
My new results are very close(see update in question), I see here you have 4X difference probably because of jvm versions. But as other said in comments you are also right about root cause of this problem.Sathrum
@Sathrum I rerun with different baseline and I still get 4x. Added my Java version and CPU.Odalisque
Probably it is improvement after jdk 8, but if it is true good to know. I will try with JDK 11 as well to verify this.Sathrum
@KarolDowbecki it does not have to be Blackhole, you could simply return the result, looks a bit cleaner to me this way. Also when you ask what is the baseline result for a single operation, you mean what is the time taken for that single division?Sec
This question is a follow-up of this question and, as explained in the comment, DoubleStream.sum() uses the Kahan summation algorithm which provides a higher precision but also takes a bit more time than a plain for loop. So it's not just loop vs stream. You may compare .sum() with reduce(0, Double::sum)...Follow
@Follow If I understand correctly Kahan summation algorithm (KSM) was part of DoubleStream.sum() in JDK 8 as well, but on JDK8 scores are close (see update section in my question), DoubleStream.sum() takes little higher time which is probably due to KSM algo. Whereas on JDK 11 numbers for simple loop are 4X better than JDK 8. I should probably write a new question for this.Sathrum
@KarolDowbecki you can be sure that the division has been elided in your baseline operation. For your original version, result += 1.0 / 1.0;, the term 1.0 / 1.0 was a compile-time constant. For your changed variant, result = 1; result = result / 1.0;, the runtime optimizer has to kick in, but then, it does not get confused by two subsequent (redundant) assignments. So your changed the increment operation to a plain assignment of 1, which is the reason why it became even faster. A true division would be significantly slower. You should try something like result = (result + 1) / 1.0;Follow

© 2022 - 2024 — McMap. All rights reserved.