Is branch prediction not working?
Asked Answered
S

1

13

In reference to this question, the answer specify that the unsorted array takes more time because it fails the branch prediction test. but if we make a minor change in the program:

import java.util.Arrays;
import java.util.Random;


public class Main{

    public static void main(String[] args) {
        // Generate data
        int arraySize = 32768;
        int data[] = new int[arraySize];

        Random rnd = new Random(0);
        for (int c = 0; c < arraySize; ++c) {
            data[c] = rnd.nextInt() % 256;
        }

        // !!! With this, the next loop runs faster
        Arrays.sort(data);

        // Test
        long start = System.nanoTime();
        long sum = 0;

        for (int i = 0; i < 100000; ++i) {
            // Primary loop
            for (int c = 0; c < arraySize; ++c) {
                if (data[c] >= 128) {
                    sum = data[c];
                }
            }
        }

        System.out.println((System.nanoTime() - start) / 1000000000.0);
        System.out.println("sum = " + sum);
    }
}

here I have replaced (from original question)

if (data[c] >= 128) 
    sum += data[c];

with

if (data[c] >= 128) 
    sum = data[c];

the unsorted array gives approx. the same result, I want to ask why branch prediction is not working in this case?

Suffice answered 29/1, 2014 at 13:23 Comment(21)
On my PC sorted array runs about 3 times faster. Which results do you get? I don't see anything strange.Rescue
which results are you comparing?Suffice
Sorted and unsorted arrays. And you? ;)Rescue
I've using sorted and unsorted arrays too.. taking about ~5 seconds on Core i5-3230M, usually unsorted is even faster, but that can be ignored because of randomness...Suffice
Please... that is not how you measure anything on the JVM. Java is not C, it has a very complex runtime which JIT-compiles the code at some point, which is not the initial point---and a myriad of other concerns which impact the measurement.Mousetrap
i'm not measuring anything, on the context of question referenced, the answer provides a very definite explanation that delay is due to the branch prediction failure, if that's the cause than why branch prediction is not working in this scenario?Suffice
Your answer works for C, it does not work for Java. Java does not run on a processor that does branch prediction. Java runs on a virtual computing machine defined in behavior by the Java specifications.Cephalochordate
@Cephalochordate "Java does not run on a processor that does branch prediction" => of course it does (or to be pedantic: the JVM that executes the java code does)!Semiotics
@MarkoTopolnik although I agree with your general comment I don't understand the behaviour...Semiotics
@Semiotics Yes that's the pedantic view but it is IMO the way you should think of Java. Especially if you try to apply low level hardware optimizations to a language that deliberately abstracts low level differences.Cephalochordate
@Semiotics It is not clear whether OP has successfully reproduced the result from the original question. He only says what happened after the change.Mousetrap
@Cephalochordate branch prediction can have an impact on a Java application. I'm not saying it should be the primary concern but you can't say that it is not applicable to Java either.Semiotics
@MarkoTopolnik I can reproduce what he describes: using sum += data[c] with or without sorting shows a perf difference (sorting 3x faster) but using sum = data[c] with or without sorting shows no perf difference.Semiotics
@Semiotics I reproduce as well. In addition, I find a significant systematic difference between trials (JVM runs), apparenly depending on the exact data in the array. The timing within a trial is stable. For example: 17 ms, then 20 ms, then 24 ms.Mousetrap
@Semiotics Forget sorting: use data[c] = (c/8) % 256 and compare with (c*31) % 256;Mousetrap
@Semiotics I didn't try to say it can't. But how / where / when is mostly up to the JVM because it can re-arrange your code in ways you did not write (e.g. add implicit array index checking branches which might interrupt the branch prediction happening in pure C or even skip executing code because it has no semantic (side)effect.)Cephalochordate
@Rescue HotSpot 7 64-bit, testing with jmh, not with OP's actual code.Mousetrap
I'm getting even stranger results. Repeatably. Unsorted with +=: 9.8seconds; Unsorted with =: 3 seconds; Sorted with += 2 seconds; Sorted with =: 3 seconds. And warming up by running the loop a few times first doesn't impact this result significantly. (jdk1.7.0_17 on MacOS)Cabbageworm
On x32 1.6 and 1.7 it's not reproducible.Rescue
@ElliottFrisch true but I don't think that this is what happens here.Semiotics
@ElliottFrisch Since the effect is fully reproducible after entirely eliminating the outer loop, the conclusion is that this loop rearrangement either doesn't happen or doesn't account for any speed advantage.Mousetrap
M
10

I have used jmh to analyze this. Here is my code:

@OutputTimeUnit(TimeUnit.MICROSECONDS)
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 2, time = 1)
@Measurement(iterations = 3, time = 1)
@State(Scope.Thread)
@Fork(2)
public class Comparison
{
  static final int SIZE = 1<<15;
  final int[] data = new int[SIZE];

  @Setup
  public void setup() {
    int i = 1;
    for (int c = 0; c < SIZE; ++c) data[c] = (i*=611953);
    for (int c = 0; c < SIZE; ++c) data[c] = data[c] >= 128? 128 : 127;
  }

  @GenerateMicroBenchmark
  public long sum() {
    long sum = 0;
    for (int c = 0; c < SIZE; ++c) if (data[c] >= 128) sum += data[c];
    return sum;
  }
}

Notice I don't use either sorting or random number generation; they are an unnecessary complication. With the formula used in the above code:

data[c] = (i*=611953);

I get 132 µs runtime. If I comment out the line involving

data[c] = data[c] >= 128? 128 : 127;

the time doesn't change at all. This eliminates all arithmetic considerations and focuses on branch prediction. If I use

data[c] = 127;

I get 13 µs, and if I use

data[c] = 128;

I get 16 µs. That's the "base case", stressing the difference between constant branching decisions.

My conclusion: this is definitely the effect of low-level branch prediction.

Could the JIT reverse the loop?

Let's analyze your intervention now. If I use the formula as presented in my code above, but change

if (data[c] >= 128) sum += data[c];

to

if (data[c] >= 128) sum = data[c];

then the timing indeed drops from 132 µs to 27 µs.

This is my guess at explaining the drop: an optimizing trick the JIT compiler can do is to reverse the direction of the loop. Now your code becomes

for (int c = SIZE-1; c <= 0; --c) if (data[c] >= 128) { sum = data[c]; break; }

the loop has been short-circuited to the minimum number of iterations needed to reach the same outcome as the original loop.

I added this

data[SIZE-1] = 128;

to the end of the setup() method, but it didn't change the timing. That would seem to invalidate the naïve version of the "loop reversal" conjecture.

No, it's probably cmovl

In analyzing assembly I find this:

cmp edx, 0x80
cmovl eax, ebx

cmovl is a conditional move instruction which will perform the effect of the assignment happening in the then branch, but without involving any jumps, therefore eliminating any penalty associated with branch prediction failure. This is a good explanation of the actual effect.

Mousetrap answered 29/1, 2014 at 14:38 Comment(7)
@Semiotics I have substantially changed my answer.Mousetrap
Well spotted detective. CMOVL definitely removes the branch. Smart JIT!Semiotics
@assylias: Smart JIT, but only sometimes. CMOVL could be used be for += as well, and I'd say that JIT does it sometimes, and the decision isn't optimal.Catawba
@Catawba Indeed - I remember that post.Semiotics
@Catawba There is a hint of that in the post linked to by OP in this question. In that question, a respondent in the comment said he is not observing the difference at all, with an appropriate -O setting on gcc.Mousetrap
That corresponds with my question assuming JIT choosing between branch and conditional move based on profiling (which is very good) but grossly underestimating the cost of branching (which is pretty bad). In the comment you kindly pointed me to, -O3 leads to conditional move, while -O2 leads to branching. Not sure if such an behavior makes sense.Catawba
Good catch. I believe a decent c/c++ compiler would have done something similar.Outcrop

© 2022 - 2024 — McMap. All rights reserved.