Why the bounds check doesn't get eliminated?
Asked Answered
W

3

20

I wrote a simple benchmark in order to find out if bounds check can be eliminated when the array gets computed via bitwise and. This is basically what nearly all hash tables do: They compute

h & (table.length - 1)

as an index into the table, where h is the hashCode or a derived value. The results shows that the bounds check don't get eliminated.

The idea of my benchmark is pretty simple: Compute two values i and j, where both are guaranteed to be valid array indexes.

  • i is the loop counter. When it gets used as array index, the bounds check gets eliminated.
  • j gets computed as x & (table.length - 1), where x is some value changing on each iteration. When it gets used as array index, the bounds check does not get eliminated.

The relevant part is as follows:

for (int i=0; i<=table.length-1; ++i) {
    x += result;
    final int j = x & (table.length-1);
    result ^= i + table[j];
}

The other experiment uses

    result ^= table[i] + j;

instead. The difference in timing is maybe 15% (pretty consistently across different variants I've tried). My questions:

  • Are there other possible reasons for this besides bound check elimination?
  • Is there some complicated reason I can't see why there's no bound check elimination for j?

A summary of the answers

MarkoTopolnik's answer shows that it's all more complicated and the elimination of the bounds checks is not guaranteed to be a win, especially on his computer the "normal" code is slower than "masked". I guess this is because of it allowing some additional optimization which shows to be actually detrimental in this case (given the complexity of the current CPUs, the compiler hardly even knows for sure).

leventov's answer shows clearly that the array bounds check gets done in "masked" and that it's elimination makes the code as fast as "normal".

Donal Fellows points to the fact, that the masking doesn't work for a zero-length table, as x & (0-1) equals to x. So the best thing the compiler can do is to replace the bound check by a zero-length check. But this is IMHO still worth it, as the zero-length check can be moved out of the loop easily.

Proposed optimization

Because of the the equivalence a[x & (a.length - 1)] throws if and only if a.length == 0, the compiler can do the following:

  • For each array access, check if the index has been computed via a bitwise and.
  • If so, check if either of the operands was computed as length minus one.
  • If so, replace the bounds check by a zero-length check.
  • Let the existing optimizations take care of it.

Such an optimization should be pretty simple and cheap as it only looks at the parent nodes in the SSA graph. Unlike many complex optimizations, it can never be detrimental, as it only replaces one check by a slightly simpler one; so there's no problem, not even if it can't be moved out of the loop.

I'll post this to the hotspot-dev mailing lists.

News

John Rose filed an RFE and there's already a "quick-and-dirty" patch.

Willtrude answered 11/2, 2014 at 13:17 Comment(33)
Was this famous serial performance question downvoter or someone willing to drop a note?Willtrude
I see a possible reason: table[i] results in a sequential access pattern, whereas with table[j] it's more irregular. Just one or two cache misses could be enough to account for 15% difference.One
@MarkoTopolnik: I forgot to mention that the whole table surely fits in L1 (1024 entries only).Willtrude
There may be other low-level optimizations for sequential access, like fetching the next value before the pipeline reaches the stage where the actual index is known.One
You have confirmed the elimination of the bounds check?One
How can you be sure the bounds check doesn't get eliminated when you don't compute the index with & table.length - 1?Endue
@MarkoTopolnik: "fetching the next value before the pipeline reaches" - sounds interesting. "confirmed the elimination" - sort of... the output of PrintAssembly is pretty unreadable, but I can see cmp %r13d,%edx, jae ... present in timeMasked only.Willtrude
Apart from the eliminated bounds check, I see more radical optimization in the "normal index" case. The loop is unrolled in both casses, but in the normal case it is also staged: first the values are bulk-loaded into various registers, then the registers are xor-ed into the accumulator.One
BTW The option -XX:CompileCommand=print,*Benchmark.time*, apart from filtering out everything you're not interested in, gives a nicer printout (doesn't show placeholders for actual register names, for one).One
To my surprise, I actually get better speed for the masked case! I see an 8% advantage on it (measuring for 10 seconds after 5 seconds warmup).One
This link tends to suggest the check is eliminated by HotSpot only when "the array is indexed by linear functions of the index variable".Gitt
@SpaceTrucker: I'm not sure, it's just my assumption. But the by ochedru suggest it's true.Willtrude
@MarkoTopolnik: That's pretty strange, could you post your code somewhere? Concerning "fetching the next value" mentioned above: I replaced x += i by x += 1, so that the access is sequential except for a single wrap around, but not much changes. I also tried to eliminate x and set j = i & (table.length-1), which is equivalent to j = i, but seems to prevent the bound check elimination.Willtrude
Here it is: pastebin.com/a2qpuV1iOne
@MarkoTopolnik: Benchmark Mode Samples Mean Mean error Units o.o.j.s.JmhBoundsCheckBenchmark.maskedIndex avgt 20 1148.463 2.447 ns/op o.o.j.s.JmhBoundsCheckBenchmark.normalIndex avgt 20 1008.069 9.905 ns/op - I just renamed and run you benchmark and it confirms my results.Willtrude
So we can conclude that the difference is down to a) JVM version/platform, b) difference in CPUs. Anyway, I wouldn't go much further chasing this, given that the difference is small to begin with, and unstable across configurations.One
Have you tried x % (table.length-1) instead of x & (table.length-1)? Maybe the compiler is not smart enough to figure out bounds of bitwise and at compile time.Flyfish
@MarkoTopolnik: Yes, once 10% faster and once 10% slower, this sounds rather boring. I guess I should update my Java first.Willtrude
@SnakE: No, but with % you have no guarantee to get a valid index as it stays negative for negative x.Willtrude
You would have to dump the JITC-generated code to see. And even a given JITC may give different results depending on subtle differences.Madeup
@MarkoTopolnik: Didn't you confuse ns/op with ops/ns as I usually do? This is about the only explanation I can think of for bounds-checking code being faster. If not, would you mind to try my third experiment pair (a simplified version with i==j, just don't tell it the JVM :D).Willtrude
Definitely didn't confuse them. pastebin.com/NcC0JbwK and the results: time3Masked 805.872 ns/op; time3Normal 945.730 ns/op.One
Totaly agree with the % suggestion - did you guarantee that the array length is a power of 2? otherwise you & doesn't make much sense, and definitely isn't enough to make your accesses sequential.Pentyl
@Leeor: I totally disagree with %. It's such a slow operation that hardly anybody uses it for hash tables. In order to ensure a valid index, you'd need to add some branching and this would make the pattern hard for the compiler to recognize. Concerning & I need a power of two for it to make sense, but I don't need it for the index to be valid (the only thing I need is a non-empty array).Willtrude
@Willtrude " hardly anybody uses % it for hash tables" -- Trove does, I do.Lussi
@maaartinus, I didn't mean he should use it in the optimized version, just for debugging, to make sure the reason is not doing & with some length like 100000Pentyl
@leventov: I hope you know what you're doing. And so do the Trove guys, but this may have historical reasons. Just note that I don't know about any use of it in Guava and AFAIK the only use in JDK is in String.intern. Division is dead slow with tens of cycles while you can do up to four simple instructions in a single cycle.Willtrude
@leventov: See e.g. the WhitespaceMatcherBenchmark and its results for how costly division is.Willtrude
@Leeor: In my benchmark linked above the length is fixed, so nothing can go wrong. And as I said, with length being no power of two, the results must stay the same (for a length like 100000, they'd change due to cache misses).Willtrude
@MarkoTopolnik: I guess, I wasn't clear: This wasn't about relating cache misses to power-of-two sizes but to array length. In the benchmark above, arrays of size 1024 or 1111 must perform the same (and for both there are no cache misses as they nicely fit). For a big array there'd be cache misses and prefetching would work for i but (probably) not for j (see also your first comment here).Willtrude
They won't perform the same (except maybe if they fit in the L1 cache), because the HW prefetchers can handle sequential streams. If your array size is 1111, your & operation would get a complicated mask with some zeros masking off parts of your dataset, so you'll get both an irregular stream, and a subset of your entire datasetPentyl
@Leeor: But in this benchmark this doesn't matter as everything fits in L1 (and the benchmarks starts with warm cache). AFAIK, when accessing L1, the pattern doesn't matter as each L1 access takes the same time. The pattern is very important when accessing other levels because of the prefetchers.Willtrude
Agreed, I didn't know if that's a drilled down examplePentyl
L
3
  1. No, this is evidently an effect of not enough smart bounds check elimination.

I've extended a benchmark by Marko Topolnik:

@OutputTimeUnit(TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
@OperationsPerInvocation(BCElimination.N)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 10, time = 1)
@State(Scope.Thread)
@Threads(1)
@Fork(2)
public class BCElimination {
    public static final int N = 1024;
    private static final Unsafe U;
    private static final long INT_BASE;
    private static final long INT_SCALE;
    static {
        try {
            Field f = Unsafe.class.getDeclaredField("theUnsafe");
            f.setAccessible(true);
            U = (Unsafe) f.get(null);
        } catch (Exception e) {
            throw new IllegalStateException(e);
        }

        INT_BASE = U.arrayBaseOffset(int[].class);
        INT_SCALE = U.arrayIndexScale(int[].class);
    }

    private final int[] table = new int[BCElimination.N];

    @Setup public void setUp() {
        final Random random = new Random();
        for (int i=0; i<table.length; ++i) table[i] = random.nextInt();
    }

    @GenerateMicroBenchmark public int normalIndex() {
        int result = 0;
        final int[] table = this.table;
        int x = 0;
        for (int i=0; i<=table.length-1; ++i) {
            x += i;
            final int j = x & (table.length-1);
            result ^= table[i] + j;
        }
        return result;
    }

    @GenerateMicroBenchmark public int maskedIndex() {
        int result = 0;
        final int[] table = this.table;
        int x = 0;
        for (int i=0; i<=table.length-1; ++i) {
            x += i;
            final int j = x & (table.length-1);
            result ^= i + table[j];
        }
        return result;
    }

    @GenerateMicroBenchmark public int maskedIndexUnsafe() {
        int result = 0;
        final int[] table = this.table;
        long x = 0;
        for (int i=0; i<=table.length-1; ++i) {
            x += i * INT_SCALE;
            final long j = x & ((table.length-1) * INT_SCALE);
            result ^= i + U.getInt(table, INT_BASE + j);
        }
        return result;
    }
}

Results:

Benchmark                                Mean   Mean error    Units
BCElimination.maskedIndex               1,235        0,004    ns/op
BCElimination.maskedIndexUnsafe         1,092        0,007    ns/op
BCElimination.normalIndex               1,071        0,008    ns/op


2. The second question is for hotspot-dev mailing lists rather than StackOverflow, IMHO.

Lussi answered 11/2, 2014 at 17:28 Comment(11)
Except that I checked the machine code and 1) maskedIndex had the bounds check where normalIndex didn't; 2) normalIndex used apparently streamlined code where the loop was both unrolled and reordered into two stages; 3) maskedIndex was nevertheless faster on my machine (by 8%).One
@MarkoTopolnik that is really strange, but more about pecularities of your CPU behaviour, than the subject of this questionLussi
You mean, this question can be separated from the peculiarities of CPU behavior? That's a strange thought... anyway, these are my results: maskedIndex 1.152 ns/op; maskedUnsafeIndex 1.116 ns/op; normalIndex 1.220 ns/op. Normal indexing is still the slowest on my machine.One
Strange that I didn't think about using Unsafe to check my conjecture!Willtrude
@Willtrude Note that introducing Unsafe doesn't really pinpoint the issue because different instructions are used to calculate the array offset (among other differences). Each just produces different machine code without a clear-cut way of comparing them.One
@MarkoTopolnik: That's true, but you can make it much more similar using U.getInt(table, INT_BASE + j * INT_SCALE) and this can be easily translated to exactly the same code. I haven't checked if it actually does.Willtrude
@Willtrude U.getInt(table, INT_BASE + j * INT_SCALE) works slower probably because JIT compiles array indexing into a single compute-address-and-memory-op command, where scale should be very small immediate (max 3 bits in assembly command code), but when you write * INT_SCALE, it doesn't know that INT_SCALE is smaller than 8, and compiles this construction in a couple of commands: mul, then offset-and-memory-op. Be careful. I didn't try * 4L neither to look into assembly (I'm lazy), though.Lussi
@Lussi The way you wrote it also ends up as a mul.One
My benchmark's results showed no difference after this change. As INT_SCALE is private static final, so I hope the JIT can do whatever it wants. I'm too lazy too. The safe way would be to set it to 4 and throw in the initializer if the JVM is too exotic. @MarkoTopolnik: Yes, but it's independent of x and can be moved out of the loop.Willtrude
In x += i * INT_SCALE, for each i there is a mul instruction. It can't be hoisted. Note however that mul by 4 is executed as shift left by 2, therefore it is very fast.One
@Willtrude Marko is right, my previous comment was removed due to obscene language. My assumption was that multiplying before masking lefts more flexibility for CPU to arrange instructions on arithmetic pipeline.Lussi
O
5

To start off, the main difference between your two tests is definitely in bounds check elimination; however, the way this influences the machine code is far from what the naïve expectation would suggest.

My conjecture:

The bounds check figures more strongly as a loop exit point than as additional code which introduces overhead.

The loop exit point prevents the following optimization which I have culled from the emitted machine code:

  • the loop is unrolled (this is true in all cases);
  • additionaly, the fetching from the array stage is done first for all unrolled steps, then the xoring into accumulator is done for all the steps.

If the loop can break out at any step, this staging would result in work performed for loop steps which were never actually taken.

Consider this slight modification of your code:

@OutputTimeUnit(TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
@OperationsPerInvocation(Measure.N)
@Warmup(iterations = 3, time = 1)
@Measurement(iterations = 5, time = 1)
@State(Scope.Thread)
@Threads(1)
@Fork(1)
 public class Measure {
  public static final int N = 1024;

  private final int[] table = new int[N];
  @Setup public void setUp() {
    final Random random = new Random();
    for (int i = 0; i < table.length; ++i) {
      final int x = random.nextInt();
      table[i] = x == 0? 1 : x;
    }
  }
  @GenerateMicroBenchmark public int normalIndex() {
    int result = 0;
    final int[] table = this.table;
    int x = 0;
    for (int i = 0; i <= table.length - 1; ++i) {
      x += i;
      final int j = x & (table.length - 1);
      final int entry = table[i];
      result ^= entry + j;
      if (entry == 0) break;
    }
    return result;
  }
  @GenerateMicroBenchmark public int maskedIndex() {
    int result = 0;
    final int[] table = this.table;
    int x = 0;
    for (int i = 0; i <= table.length - 1; ++i) {
      x += i;
      final int j = x & (table.length - 1);
      final int entry = table[j];
      result ^= i + entry;
      if (entry == 0) break;
    }
    return result;
  }
}

There is just one difference: I have added the check

if (entry == 0) break;

to give the loop a way to exit prematurely on any step. (I also introduced a guard to ensure no array entries are actually 0.)

On my machine, this is the result:

Benchmark                   Mode   Samples         Mean   Mean error    Units
o.s.Measure.maskedIndex     avgt         5        1.378        0.229    ns/op
o.s.Measure.normalIndex     avgt         5        0.924        0.092    ns/op

the "normal index" variant is substantially faster, as generally expected.

However, let us remove the additional check:

// if (entry == 0) break;

Now my results are these:

Benchmark                   Mode   Samples         Mean   Mean error    Units
o.s.Measure.maskedIndex     avgt         5        1.130        0.065    ns/op
o.s.Measure.normalIndex     avgt         5        1.229        0.053    ns/op

"Masked index" responded predictably (reduced overhead), but "normal index" is suddenly much worse. This is apparently due to a bad fit between the additional optimization step and my specific CPU model.

My point:

The performance model at such a detailed level is very unstable and, as witnessed on my CPU, even erratic.

One answered 12/2, 2014 at 9:19 Comment(4)
You think that this "the fetching from the array stage is done first for all unrolled steps" is the culprit, right? Interesting!Willtrude
Ideally, we should compare notes. The above technique isolates the effect of BCE from the effect of the additional staging optimization, so it would be interesting to see what it does on your side.One
Yes, we should. The comments here are rather unsuitable for this. I think, it could be an interesting question, would you mind posting it? Otherwise, please send me an email to <my_name_here>@gmail.com.Willtrude
I have posted it as a question: #21739190One
L
3
  1. No, this is evidently an effect of not enough smart bounds check elimination.

I've extended a benchmark by Marko Topolnik:

@OutputTimeUnit(TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
@OperationsPerInvocation(BCElimination.N)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 10, time = 1)
@State(Scope.Thread)
@Threads(1)
@Fork(2)
public class BCElimination {
    public static final int N = 1024;
    private static final Unsafe U;
    private static final long INT_BASE;
    private static final long INT_SCALE;
    static {
        try {
            Field f = Unsafe.class.getDeclaredField("theUnsafe");
            f.setAccessible(true);
            U = (Unsafe) f.get(null);
        } catch (Exception e) {
            throw new IllegalStateException(e);
        }

        INT_BASE = U.arrayBaseOffset(int[].class);
        INT_SCALE = U.arrayIndexScale(int[].class);
    }

    private final int[] table = new int[BCElimination.N];

    @Setup public void setUp() {
        final Random random = new Random();
        for (int i=0; i<table.length; ++i) table[i] = random.nextInt();
    }

    @GenerateMicroBenchmark public int normalIndex() {
        int result = 0;
        final int[] table = this.table;
        int x = 0;
        for (int i=0; i<=table.length-1; ++i) {
            x += i;
            final int j = x & (table.length-1);
            result ^= table[i] + j;
        }
        return result;
    }

    @GenerateMicroBenchmark public int maskedIndex() {
        int result = 0;
        final int[] table = this.table;
        int x = 0;
        for (int i=0; i<=table.length-1; ++i) {
            x += i;
            final int j = x & (table.length-1);
            result ^= i + table[j];
        }
        return result;
    }

    @GenerateMicroBenchmark public int maskedIndexUnsafe() {
        int result = 0;
        final int[] table = this.table;
        long x = 0;
        for (int i=0; i<=table.length-1; ++i) {
            x += i * INT_SCALE;
            final long j = x & ((table.length-1) * INT_SCALE);
            result ^= i + U.getInt(table, INT_BASE + j);
        }
        return result;
    }
}

Results:

Benchmark                                Mean   Mean error    Units
BCElimination.maskedIndex               1,235        0,004    ns/op
BCElimination.maskedIndexUnsafe         1,092        0,007    ns/op
BCElimination.normalIndex               1,071        0,008    ns/op


2. The second question is for hotspot-dev mailing lists rather than StackOverflow, IMHO.

Lussi answered 11/2, 2014 at 17:28 Comment(11)
Except that I checked the machine code and 1) maskedIndex had the bounds check where normalIndex didn't; 2) normalIndex used apparently streamlined code where the loop was both unrolled and reordered into two stages; 3) maskedIndex was nevertheless faster on my machine (by 8%).One
@MarkoTopolnik that is really strange, but more about pecularities of your CPU behaviour, than the subject of this questionLussi
You mean, this question can be separated from the peculiarities of CPU behavior? That's a strange thought... anyway, these are my results: maskedIndex 1.152 ns/op; maskedUnsafeIndex 1.116 ns/op; normalIndex 1.220 ns/op. Normal indexing is still the slowest on my machine.One
Strange that I didn't think about using Unsafe to check my conjecture!Willtrude
@Willtrude Note that introducing Unsafe doesn't really pinpoint the issue because different instructions are used to calculate the array offset (among other differences). Each just produces different machine code without a clear-cut way of comparing them.One
@MarkoTopolnik: That's true, but you can make it much more similar using U.getInt(table, INT_BASE + j * INT_SCALE) and this can be easily translated to exactly the same code. I haven't checked if it actually does.Willtrude
@Willtrude U.getInt(table, INT_BASE + j * INT_SCALE) works slower probably because JIT compiles array indexing into a single compute-address-and-memory-op command, where scale should be very small immediate (max 3 bits in assembly command code), but when you write * INT_SCALE, it doesn't know that INT_SCALE is smaller than 8, and compiles this construction in a couple of commands: mul, then offset-and-memory-op. Be careful. I didn't try * 4L neither to look into assembly (I'm lazy), though.Lussi
@Lussi The way you wrote it also ends up as a mul.One
My benchmark's results showed no difference after this change. As INT_SCALE is private static final, so I hope the JIT can do whatever it wants. I'm too lazy too. The safe way would be to set it to 4 and throw in the initializer if the JVM is too exotic. @MarkoTopolnik: Yes, but it's independent of x and can be moved out of the loop.Willtrude
In x += i * INT_SCALE, for each i there is a mul instruction. It can't be hoisted. Note however that mul by 4 is executed as shift left by 2, therefore it is very fast.One
@Willtrude Marko is right, my previous comment was removed due to obscene language. My assumption was that multiplying before masking lefts more flexibility for CPU to arrange instructions on arithmetic pipeline.Lussi
F
1

In order to safely eliminate that bounds check, it is necessary to prove that

h & (table.length - 1)

is guaranteed to produce a valid index into table. It won't if table.length is zero (as you'll end up with & -1, an effective-noop). It also won't usefully do it if table.length is not a power of 2 (you'll lose information; consider the case where table.length is 17).

How can the HotSpot compiler know that these bad conditions are not true? It has to be more conservative than a programmer can be, as the programmer can know more about the high-level constraints on the system (e.g., that the array is never empty and always as a number of elements that is a power-of-two).

Fortuneteller answered 12/2, 2014 at 8:15 Comment(4)
I don't understand your comment about powers of 2. If h and k are nonnegative integers, then h & k is a nonnegative integer that is at most h and at most k.Aboveground
@Aboveground That's technically not a safety condition, but it can produce terrible distributions. Consider the case with 17 buckets; you'll end up with everything going into either bucket 0 or (rarely) 16. The only case where h&(ary.length-1) works well is when the array is where the array's size is a power of two (>=1), and the compiler isn't provided an easy proof of this.Fortuneteller
I don't follow. If it's "technically not a safety condition", and the compiler's goal is merely "to safely eliminate that bounds check", then how is it relevant to the compiler? Why does the compiler need to be able to prove it?Aboveground
@ruakh: Agreed, the compiler shouldn't care. It can replace the bounds check by the table.length > 0 check and let the worries about the distribution for the programmer.Willtrude

© 2022 - 2024 — McMap. All rights reserved.