Performance of Guava's ImmutableSet.contains
Asked Answered
K

1

16

Guava's ImmutableSet seems to perform quite poorly in my benchmark concerning contains. For some sizes it gets even much slower than List:

  size            benchmark         ns linear runtime
100000         ListContains  110279.54 ==
100000          SetContains       7.15 =
100000 ImmutableSetContains   76716.47 =
200000         ListContains  275367.66 =====
200000          SetContains       7.34 =
200000 ImmutableSetContains  322185.50 ======
500000         ListContains  935210.10 ====================
500000          SetContains       7.79 =
500000 ImmutableSetContains 1382765.76 ==============================

Basically, I fill a set with a few thousands negative integers and test contains with non-negative ones. The code is trivial but a bit too long for pasting in a small text area, so please look here.

I wonder what's going on here. Probably, I hit some degenerate case, although I obviously didn't try to. Or maybe I've just blew the benchmark. Otherwise, I wonder if it can and should be fixed.


The solution was to change the smearing function by replacing

hashCode ^= (hashCode >>> 20) ^ (hashCode >>> 12);
return hashCode ^ (hashCode >>> 7) ^ (hashCode >>> 4);

by

return C2 * Integer.rotateLeft(hashCode * C1, 15);

This takes about the same time and might have some disadvantage, but solves the current problem by nicely spreading the hashes.

Kelwin answered 24/9, 2012 at 21:53 Comment(4)
Yep, you hit a strongly degenerate case. But this probably shouldn't be on SO; it should be an issue report.Norine
I see it now... I had to write the question first, but it's actually obvious.Kelwin
@Louis Wasserman: I was nearly sure I was messing something up (other strange things happened); that's why I wanted to ask here first. I think I've found a nice solution.Kelwin
Fixed in Guava 14: code.google.com/p/guava-libraries/issues/…Tana
K
9

The explanation is actually very simple. Imagine the integers lie in the set M = {0, 1, ..., N-1} for some N, which is a power of two. And so do the hashes, because of how Integer.hashCode is defined. The hashes get processed via a function smear identical to this one in order to minimize collision in the lowest bits in some common cases.

A table of size 2*N gets allocated and the members of M get put into it. As smear involves only right shifts, it maps M onto itself, which means that a continuous range of the table gets filled. So lets say that all slots in the left half of the table are used and the other half is unused.

When contains(o) gets invoked, the search starts in a slot which position is determined by o.hashCode(). If o gets found, the result is true, if an empty slot gets hit, the result is false. Otherwise, the search proceeds to another slot. In order to minimize cache misses, linear probing gets used.

When we are unlucky enough to start the search at the first used slot, all of them have to be traversed, which means N steps. Starting at a random position means N/4 steps on the average.

What happened in my benchmark is not exactly as above, but the reason for its bad performance is the same. Making the sizes to powers of two makes the problem worse.

Kelwin answered 25/9, 2012 at 18:14 Comment(3)
Yep, this basically covers it. And even if we use e.g. quadratic probing, it's still almost as bad. I guess the real issue is that smear should do more smearing?Norine
@Louis Wasserman: Quadratic probing would come to the unused half in time proportional to sqrt(N) rather then N, but it's still too far from 1 and quite inefficient because of cache misses. However, I'd suggest to switch to some combination of linear and some stronger probing, as linear probing can fail too easily. And yes, smearing can be extended so the above problem goes away and the property promised still holds. Should I file a bug and explain it there?Kelwin
That's what I thought before I benchmarked and got hard numbers. Quadratic probing might help theoretically, but not in reality, from what I can tell. But yes, file a bug.Norine

© 2022 - 2024 — McMap. All rights reserved.