Is branch prediction not working?

import java.util.Arrays; import java.util.Random; public class Main{ public static void main(String[] args) { // Generate data int arraySize = 32768; int data[] = new int[arraySize]; Random rnd = new Random(0); for (int c = 0; c < arraySize; ++c) { data[c] = rnd.nextInt() % 256; } // !!! With this, the next loop runs faster Arrays.sort(data); // Test long start = System.nanoTime(); long sum = 0; for (int i = 0; i < 100000; ++i) { // Primary loop for (int c = 0; c < arraySize; ++c) { if (data[c] >= 128) { sum = data[c]; } } } System.out.println((System.nanoTime() - start) / 1000000000.0); System.out.println("sum = " + sum); } }

I have used jmh to analyze this. Here is my code:

@OutputTimeUnit(TimeUnit.MICROSECONDS)
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 2, time = 1)
@Measurement(iterations = 3, time = 1)
@State(Scope.Thread)
@Fork(2)
public class Comparison
{
  static final int SIZE = 1<<15;
  final int[] data = new int[SIZE];

  @Setup
  public void setup() {
    int i = 1;
    for (int c = 0; c < SIZE; ++c) data[c] = (i*=611953);
    for (int c = 0; c < SIZE; ++c) data[c] = data[c] >= 128? 128 : 127;
  }

  @GenerateMicroBenchmark
  public long sum() {
    long sum = 0;
    for (int c = 0; c < SIZE; ++c) if (data[c] >= 128) sum += data[c];
    return sum;
  }
}

Notice I don't use either sorting or random number generation; they are an unnecessary complication. With the formula used in the above code:

data[c] = (i*=611953);

I get 132 µs runtime. If I comment out the line involving

data[c] = data[c] >= 128? 128 : 127;

the time doesn't change at all. This eliminates all arithmetic considerations and focuses on branch prediction. If I use

data[c] = 127;

I get 13 µs, and if I use

data[c] = 128;

I get 16 µs. That's the "base case", stressing the difference between constant branching decisions.

My conclusion: this is definitely the effect of low-level branch prediction.

Could the JIT reverse the loop?

Let's analyze your intervention now. If I use the formula as presented in my code above, but change

if (data[c] >= 128) sum += data[c];

if (data[c] >= 128) sum = data[c];

then the timing indeed drops from 132 µs to 27 µs.

This is my guess at explaining the drop: an optimizing trick the JIT compiler can do is to reverse the direction of the loop. Now your code becomes

for (int c = SIZE-1; c <= 0; --c) if (data[c] >= 128) { sum = data[c]; break; }

the loop has been short-circuited to the minimum number of iterations needed to reach the same outcome as the original loop.

I added this

data[SIZE-1] = 128;

to the end of the setup() method, but it didn't change the timing. That would seem to invalidate the naïve version of the "loop reversal" conjecture.

No, it's probably `cmovl`

In analyzing assembly I find this:

cmp edx, 0x80
cmovl eax, ebx

cmovl is a conditional move instruction which will perform the effect of the assignment happening in the then branch, but without involving any jumps, therefore eliminating any penalty associated with branch prediction failure. This is a good explanation of the actual effect.

Could the JIT reverse the loop?

No, it's probably `cmovl`

Recommended topics

Hot tags

Could the JIT reverse the loop?

No, it's probably cmovl

Recommended topics

Hot tags

No, it's probably `cmovl`