I wrote a simple benchmark in order to find out if bounds check can be eliminated when the array gets computed via bitwise and. This is basically what nearly all hash tables do: They compute
h & (table.length - 1)
as an index into the table
, where h
is the hashCode
or a derived value. The results shows that the bounds check don't get eliminated.
The idea of my benchmark is pretty simple: Compute two values i
and j
, where both are guaranteed to be valid array indexes.
i
is the loop counter. When it gets used as array index, the bounds check gets eliminated.j
gets computed asx & (table.length - 1)
, wherex
is some value changing on each iteration. When it gets used as array index, the bounds check does not get eliminated.
The relevant part is as follows:
for (int i=0; i<=table.length-1; ++i) {
x += result;
final int j = x & (table.length-1);
result ^= i + table[j];
}
The other experiment uses
result ^= table[i] + j;
instead. The difference in timing is maybe 15% (pretty consistently across different variants I've tried). My questions:
- Are there other possible reasons for this besides bound check elimination?
- Is there some complicated reason I can't see why there's no bound check elimination for
j
?
A summary of the answers
MarkoTopolnik's answer shows that it's all more complicated and the elimination of the bounds checks is not guaranteed to be a win, especially on his computer the "normal" code is slower than "masked". I guess this is because of it allowing some additional optimization which shows to be actually detrimental in this case (given the complexity of the current CPUs, the compiler hardly even knows for sure).
leventov's answer shows clearly that the array bounds check gets done in "masked" and that it's elimination makes the code as fast as "normal".
Donal Fellows points to the fact, that the masking doesn't work for a zero-length table, as x & (0-1)
equals to x
. So the best thing the compiler can do is to replace the bound check by a zero-length check. But this is IMHO still worth it, as the zero-length check can be moved out of the loop easily.
Proposed optimization
Because of the the equivalence a[x & (a.length - 1)]
throws if and only if a.length == 0
, the compiler can do the following:
- For each array access, check if the index has been computed via a bitwise and.
- If so, check if either of the operands was computed as length minus one.
- If so, replace the bounds check by a zero-length check.
- Let the existing optimizations take care of it.
Such an optimization should be pretty simple and cheap as it only looks at the parent nodes in the SSA graph. Unlike many complex optimizations, it can never be detrimental, as it only replaces one check by a slightly simpler one; so there's no problem, not even if it can't be moved out of the loop.
I'll post this to the hotspot-dev mailing lists.
News
John Rose filed an RFE and there's already a "quick-and-dirty" patch.
table[i]
results in a sequential access pattern, whereas withtable[j]
it's more irregular. Just one or two cache misses could be enough to account for 15% difference. – One& table.length - 1
? – EnduePrintAssembly
is pretty unreadable, but I can seecmp %r13d,%edx, jae ...
present intimeMasked
only. – Willtrude-XX:CompileCommand=print,*Benchmark.time*
, apart from filtering out everything you're not interested in, gives a nicer printout (doesn't show placeholders for actual register names, for one). – Onex += i
byx += 1
, so that the access is sequential except for a single wrap around, but not much changes. I also tried to eliminatex
and setj = i & (table.length-1)
, which is equivalent toj = i
, but seems to prevent the bound check elimination. – Willtrudex % (table.length-1)
instead ofx & (table.length-1)
? Maybe the compiler is not smart enough to figure out bounds of bitwise and at compile time. – Flyfish%
you have no guarantee to get a valid index as it stays negative for negativex
. – Willtrudens/op
withops/ns
as I usually do? This is about the only explanation I can think of for bounds-checking code being faster. If not, would you mind to try my third experiment pair (a simplified version withi==j
, just don't tell it the JVM :D). – Willtrudetime3Masked 805.872 ns/op; time3Normal 945.730 ns/op
. – One%
suggestion - did you guarantee that the array length is a power of 2? otherwise you&
doesn't make much sense, and definitely isn't enough to make your accesses sequential. – Pentyl%
. It's such a slow operation that hardly anybody uses it for hash tables. In order to ensure a valid index, you'd need to add some branching and this would make the pattern hard for the compiler to recognize. Concerning&
I need a power of two for it to make sense, but I don't need it for the index to be valid (the only thing I need is a non-empty array). – Willtrude%
it for hash tables" -- Trove does, I do. – Lussi&
with some length like100000
– PentylString.intern
. Division is dead slow with tens of cycles while you can do up to four simple instructions in a single cycle. – WilltrudeWhitespaceMatcherBenchmark
and its results for how costly division is. – Willtrudei
but (probably) not forj
(see also your first comment here). – Willtrude&
operation would get a complicated mask with some zeros masking off parts of your dataset, so you'll get both an irregular stream, and a subset of your entire dataset – Pentyl