When should streams be preferred over traditional loops for best performance? Do streams take advantage of branch-prediction?

Asked 22/12, 2016 at 8:26 Answered 27/12, 2016 at 23:13

Solved java performance java-8 java-stream branch-prediction

I just read about Branch-Prediction and wanted to try how this works with Java 8 Streams.

However the performance with Streams is always turning out to be worse than traditional loops.

int totalSize = 32768;
int filterValue = 1280;
int[] array = new int[totalSize];
Random rnd = new Random(0);
int loopCount = 10000;

for (int i = 0; i < totalSize; i++) {
    // array[i] = rnd.nextInt() % 2560; // Unsorted Data
    array[i] = i; // Sorted Data
}

long start = System.nanoTime();
long sum = 0;
for (int j = 0; j < loopCount; j++) {
    for (int c = 0; c < totalSize; ++c) {
        sum += array[c] >= filterValue ? array[c] : 0;
    }
}
long total = System.nanoTime() - start;
System.out.printf("Conditional Operator Time : %d ns, (%f sec) %n", total, total / Math.pow(10, 9));

start = System.nanoTime();
sum = 0;
for (int j = 0; j < loopCount; j++) {
    for (int c = 0; c < totalSize; ++c) {
        if (array[c] >= filterValue) {
            sum += array[c];
        }
    }
}
total = System.nanoTime() - start;
System.out.printf("Branch Statement Time : %d ns, (%f sec) %n", total, total / Math.pow(10, 9));

start = System.nanoTime();
sum = 0;
for (int j = 0; j < loopCount; j++) {
    sum += Arrays.stream(array).filter(value -> value >= filterValue).sum();
}
total = System.nanoTime() - start;
System.out.printf("Streams Time : %d ns, (%f sec) %n", total, total / Math.pow(10, 9));

start = System.nanoTime();
sum = 0;
for (int j = 0; j < loopCount; j++) {
    sum += Arrays.stream(array).parallel().filter(value -> value >= filterValue).sum();
}
total = System.nanoTime() - start;
System.out.printf("Parallel Streams Time : %d ns, (%f sec) %n", total, total / Math.pow(10, 9));

Output :

For Sorted-Array :

Conditional Operator Time : 294062652 ns, (0.294063 sec) 
Branch Statement Time : 272992442 ns, (0.272992 sec) 
Streams Time : 806579913 ns, (0.806580 sec) 
Parallel Streams Time : 2316150852 ns, (2.316151 sec)

For Un-Sorted Array:

Conditional Operator Time : 367304250 ns, (0.367304 sec) 
Branch Statement Time : 906073542 ns, (0.906074 sec) 
Streams Time : 1268648265 ns, (1.268648 sec) 
Parallel Streams Time : 2420482313 ns, (2.420482 sec)

I tried the same code using List:
list.stream() instead of Arrays.stream(array)
list.get(c) instead of array[c]

Output :

For Sorted-List :

Conditional Operator Time : 860514446 ns, (0.860514 sec) 
Branch Statement Time : 663458668 ns, (0.663459 sec) 
Streams Time : 2085657481 ns, (2.085657 sec) 
Parallel Streams Time : 5026680680 ns, (5.026681 sec)

For Un-Sorted List

Conditional Operator Time : 704120976 ns, (0.704121 sec) 
Branch Statement Time : 1327838248 ns, (1.327838 sec) 
Streams Time : 1857880764 ns, (1.857881 sec) 
Parallel Streams Time : 2504468688 ns, (2.504469 sec)

I referred to few blogs this & this which suggest the same performance issue w.r.t streams.

I agree to the point that programming with streams is nice and easier for some scenarios but when we're losing out on performance, why do we need to use them? Is there something I'm missing out on?
Which is the scenario in which streams perform equal to loops? Is it only in the case where your function defined takes a lot of time, resulting in a negligible loop performance?
In none of the scenario's I could see streams taking advantage of branch-prediction (I tried with sorted and unordered streams, but of no use. It gave more than double the performance impact compared to normal streams)?

Heteronym answered 22/12, 2016 at 8:26 Comment(8)

most performance problems in applications are caused by premature optimization like this. – Uncinus 22/12, 2016 at 8:46

@TimothyTruckle: I am curious. Could you give an example? – Sturges 22/12, 2016 at 13:20

@Sturges OK, maybe not the most performance problems, but problems in the programs maintainability and evolvability: ubiquity.acm.org/article.cfm?id=1513451 - wiki.c2.com/?PrematureOptimization - flounder.com/optimization.htm – Uncinus 22/12, 2016 at 13:26

Maybe streams could be implemented better, who knows... – Lullaby 22/12, 2016 at 13:30

@TimothyTruckle That's why I asked. I know about the other problems. Never heard of performance problems due to "premature" optimization. – Sturges 22/12, 2016 at 13:35

Your assumption that performance should be the primary consideration is deeply misguided. Write the code that most clearly expresses your intent. Streams are plenty fast for the vast majority of cases. – Hardspun 22/12, 2016 at 15:57

@Sturges It's not unheard of for people to completely misunderstand where the performance bottleneck is. – Tauro 22/12, 2016 at 23:3

Why are you trying to discuss performance without even providing a proper benchmark? Any measurements you do with a half-assed measuring system like this, are worthless. – Sprawl 27/12, 2016 at 18:26

I agree to the point that programming with streams is nice and easier for some scenarios but when we're losing out on performance, why do we need to use them?

Performance is rarely an issue. It would be usual for 10% of your streams would need to be rewritten as loops to get the performance you need.

Is there something I'm missing out on?

Using parallelStream() is much easier using streams and possibly more efficient as it's hard to write efficient concurrent code.

Which is the scenario in which streams perform equal to loops? Is it only in the case where your function defined takes a lot of time, resulting in a negligible loop performance?

Your benchmark is flawed in the sense that the code hasn't been compiled when it starts. I would do the whole test in a loop as JMH does, or I would use JMH.

In none of the scenario's I could see streams taking advantage of branch-prediction

Branch prediction is a CPU feature not a JVM or streams feature.

Coyote answered 22/12, 2016 at 8:32 Comment(8)

True, Branch-Prediction is a CPU feature. But when I know a traditional loop is going to perform better and given that it can (sometimes) make use of my CPU features, why would I prefer streams? Also, even other blogs who did similar testing found the performance impact. So I wasn't sure if using streams would give me any advantage. Your right, its up to a programmer to decide what he wants to choose and in case performance is a major concern he can always switch to loops. – Heteronym 22/12, 2016 at 8:54

@Bandi Kishore: when you see the parallel processing slowing the operation down by factor two, you may consider that the array is way too tiny to allow useful statements about the performance. Also, you should learn that while a conditional expression looks different, i.e. more compact than an if statement, there is no technical difference in the code. Both contain branches, so if the conditional expression appears to be significantly faster, it’s a sign of a flawed benchmark setup, as other side effects seem to dominate the performance. – Fibrin 22/12, 2016 at 10:58

@Fibrin I don't think that's true. Conditional statements are actually interpreted in a different way by the system (at least as per what I've read. It has a separate instruction called cmovl which performs this) So its relatively faster. Source : https://mcmap.net/q/14898/-why-is-processing-a-sorted-array-faster-than-processing-an-unsorted-array Even if the benchmark is flawed the difference in output shouldn't be this high. – Heteronym 22/12, 2016 at 11:19

@Bandi Kishore: you have tagged your question with [java] and posted Java source code only. In Java, there is no such thing like cmovl. Your source code is compiled to Java bytecode first and if two different constructs produce identical bytecode, they may or may not get optimized to whatever native code you may think of, but they can’t exhibit a fundamental difference in any way. The JVM simply doesn’t know, whether you used an if statement or a conditional expression in the source code. All it sees, are branches in the bytecode. – Fibrin 22/12, 2016 at 11:23

@Fibrin So you mean in Java conditional operator and if statement's have no difference? Its strange to see a difference in the output every single time I use these operators. For an unsorted data with same loops (array or list) there is a clear difference that conditional operators are always performing better. (at least x2 better). So there has to be an explanation for this difference right? – Heteronym 22/12, 2016 at 11:28

@Bandi Kishore: the difference is that in one case, you’re adding zero if the condition isn’t fulfilled, whereas in the other, you’re not adding a value at all in that case. So there’s a slight difference which may guide the JVM’s optimization decisions in a different direction, but the result is not as predictable as you seem to think. But in either case, the byte code is not branch-free. By the way, you may similarly replace .filter(value -> value>=filterValue) with .map(value -> value>=filterValue? value: 0) to see whether there’s a benefit in your particular runtime environment. – Fibrin 22/12, 2016 at 11:45

@Bandi Kishore: by the way, your sorted array has 1280 values below the threshold and 32768 - 1280 above. That’s producing an entirely different branch likehood than the random data being almost evenly distributed to both sides (almost, you should use rnd.nextInt(bound) instead of rnd.nextInt() % bound). If you want to compare processing sorted or unsorted arrays, you should simply sort or shuffle the array between the runs, without changing the numbers. – Fibrin 22/12, 2016 at 11:52

@Fibrin You're right. I need some changes in my code and re-evaluate this. I guess Flown's answer covers some of it. My assumptions were quite wrong, Thanks for the clarification :) – Heteronym 22/12, 2016 at 22:14

Java is a high level language saving the programmer from considering low level performance optimization.

Never choose a certain approach for performance reasons unless you have proven that this is a problem in your real application.

Your measurements show some negative effect for streams, but the difference is below observability. Therefore, it's not a Problem. Also, this Test is a "synthetic" situation and the code may behave completely different in a heavy duty production environment. Furthermore, the machine code created from your Java (byte) code by the JIT may change in future Java (maintenance) releases and make your measurements obsolete.

In conclusion: Choose the syntax or approach that most expresses your (the programmer's) intention. Keep to that same approach or syntax throughout the program unless you have a good reason to change.

Uncinus answered 22/12, 2016 at 8:44 Comment(3)

More succinctly: Premature optimization kills projects. – Fadeout 22/12, 2016 at 16:36

@Fadeout I love how people hide behind this ;) – Overburden 22/12, 2016 at 21:44

@TimothyTruckle Agreed. I was looking at some low level details which wasn't much of a concern and if it was, I could always switch back to loops. Good Explanation :) – Heteronym 22/12, 2016 at 22:8

Everything is said, but I want to show you how your code should look like using JMH.

@Fork(3)
@BenchmarkMode(Mode.AverageTime)
@Measurement(iterations = 10, timeUnit = TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
@Threads(1)
@Warmup(iterations = 5, timeUnit = TimeUnit.NANOSECONDS)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class MyBenchmark {

  private final int totalSize = 32_768;
  private final int filterValue = 1_280;
  private final int loopCount = 10_000;
  // private Random rnd;

  private int[] array;

  @Setup
  public void setup() {
    array = IntStream.range(0, totalSize).toArray();

    // rnd = new Random(0);
    // array = rnd.ints(totalSize).map(i -> i % 2560).toArray();
  }

  @Benchmark
  public long conditionalOperatorTime() {
    long sum = 0;
    for (int j = 0; j < loopCount; j++) {
      for (int c = 0; c < totalSize; ++c) {
        sum += array[c] >= filterValue ? array[c] : 0;
      }
    }
    return sum;
  }

  @Benchmark
  public long branchStatementTime() {
    long sum = 0;
    for (int j = 0; j < loopCount; j++) {
      for (int c = 0; c < totalSize; ++c) {
        if (array[c] >= filterValue) {
          sum += array[c];
        }
      }
    }
    return sum;
  }

  @Benchmark
  public long streamsTime() {
    long sum = 0;
    for (int j = 0; j < loopCount; j++) {
      sum += IntStream.of(array).filter(value -> value >= filterValue).sum();
    }
    return sum;
  }

  @Benchmark
  public long parallelStreamsTime() {
    long sum = 0;
    for (int j = 0; j < loopCount; j++) {
      sum += IntStream.of(array).parallel().filter(value -> value >= filterValue).sum();
    }
    return sum;
  }
}

The results for a sorted array:

Benchmark                            Mode  Cnt           Score           Error  Units
MyBenchmark.branchStatementTime      avgt   30   119833793,881 ±   1345228,723  ns/op
MyBenchmark.conditionalOperatorTime  avgt   30   118146194,368 ±   1748693,962  ns/op
MyBenchmark.parallelStreamsTime      avgt   30   499436897,422 ±   7344346,333  ns/op
MyBenchmark.streamsTime              avgt   30  1126768177,407 ± 198712604,716  ns/op

Results for unsorted data:

Benchmark                            Mode  Cnt           Score           Error  Units
MyBenchmark.branchStatementTime      avgt   30   534932594,083 ±   3622551,550  ns/op
MyBenchmark.conditionalOperatorTime  avgt   30   530641033,317 ±   8849037,036  ns/op
MyBenchmark.parallelStreamsTime      avgt   30   489184423,406 ±   5716369,132  ns/op
MyBenchmark.streamsTime              avgt   30  1232020250,900 ± 185772971,366  ns/op

I only can say that there are many possibilities of JVM optimizations and maybe branch-prediction is also involved. Now it is up to you to interpret the benchmark results.

Belong answered 22/12, 2016 at 12:40 Comment(3)

Your tests are a bit flawed: 4 test methods, 3 Forks; warming up in nano-seconds (make at least mili-seconds); results in nano-seconds also. Also the errors are pretty big, you could try to execute them with -Xmx -Xms 4G for example to make sure then GC calls are not messing your results. – Overburden 22/12, 2016 at 22:12

that array generation should really be a set-up method. – Overburden 22/12, 2016 at 22:13

@Overburden You're right this benchmark is a bit flawed in terms of GC, min and max heapsize and setup step, but not in forking, timeunits and warmup. Since I haven't specified any time there is no limit. So warmup is limited to 1 second. Also I think you should read about @Fork since every method is forked 3 times not all methods together. I really don't care about 5-10% error, since the whole benchmark should show a trend not a perfect benchmark. – Belong 23/12, 2016 at 10:56

I will add my 0.02$ here.

I just read about Branch-Prediction and wanted to try how this works with Java 8 Streams

Branch Prediction is a CPU feature, it has nothing to do with JVM. It is needed to keep CPU pipeline full and ready to do something. Measuring or predicting the branch prediction is extremely hard (unless you actually know the EXACT things that the CPU will do). This will depend on at least the load that the CPU is having right now (that might be a lot more than your program only).

However the performance with Streams is always turning out to be worse than traditional loops

This statement and the previous one are un-related. Yes, streams will be slower for simple examples like yours, up to 30% slower, which is OK. You could measure for a particular case how slower they are or faster via JMH as others have suggested, but that proves only that case, only that load.

At the same time you might be working with Spring/Hibernate/Services, etc etc that do things in milliseconds and your streams in nano-seconds and you worry about the performance? You are questioning the speed of your fastest part of the code? That's of course a theoretical thing.

And about your last point that you tried with sorted and un-sorted arrays and it gives you bad results. This is absolutely no indication of branch prediction or not - you have no idea at which point the prediction happened and if it did unless you can look inside the actual CPU pipelines - which you did not.

Overburden answered 22/12, 2016 at 22:3 Comment(1)

Yes. I was comparing 2 different items here. And you're right. Compared to the value streams are adding, its okay to not look at such minor details. +1 for comparing it with frameworks which we use in-spite of them working in ms because it just makes life easy. – Heteronym 22/12, 2016 at 22:21

How can my Java program run fast?

Long story short, Java programs can be accelerated by:

Multithreading
JIT

Do streams relate to Java program speedup?

Yes!

Note Collection.parallelStream() and Stream.parallel() methods for multithreading
One can write for cycle that is long enough for JIT to skip. Lambdas are typically small and can be compiled by JIT => there's possibility to gain performance

What is the scenario stream can be faster than `for` loop?

Let's take a look at jdk/src/share/vm/runtime/globals.hpp

develop(intx, HugeMethodLimit,  8000,
        "Don't compile methods larger than this if "
        "+DontCompileHugeMethods")

If you have long enough cycle, it won't be compiled by JIT and will run slowly. If you rewrite such a cycle to stream you'll probably use map, filter, flatMap methods that split code to pieces and every piece can be small enough to fit under limit. For sure, writing huge methods has other downsides apart from JIT compilation. This scenario can be considered if, for example, you've got a lot of generated code.

What's about branch prediction?

Of course streams take advantage of branch prediction as every other code does. However branch prediction isn't the technology explicitly used to make streams faster AFAIK.

So, when do I rewrite my loops to streams to achieve the best performance?

Never.

Premature optimization is the root of all evil ©Donald Knuth

Try to optimize algorithm instead. Streams are the interface for functional-like programming, not a tool to speedup loops.

Daph answered 27/12, 2016 at 23:13 Comment(2)

Whenefer someone mentions this quote, I feel the urge to repeat the quote in its original context: "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified." (emphasis by me). But apart from this (and the "Never"), +1, also for the last sentence. – Titanism 27/12, 2016 at 23:28

Personally, I feel like streams and lambdas generally are not as clear in intent and logic, compared to traditional iteration idioms. Since everyone is invoking Knuth all the time, he was also one of the original proponents of programming for clarity first and, as you say, optimizing in those cases where it is shown to be needed. Thus, I avoid lambdas unless they really make things clearer or solve a specific problem. Don't misunderstand, I am happy to use them in many cases. Often, I wrap complex lamba expressions in a method with an explanatory name and javadoc. – Tynes 19/10, 2020 at 15:55

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++