Why isn't randomized probing more popular in hash table implementations?

T

5

8

According to various sources, such as Wikipedia and various .edu websites found by Google, the most common ways for a hash table to resolve collisions are linear or quadratic probing and chaining. Randomized probing is briefly mentioned but not given much attention. I've implemented a hash table that uses randomized probing to resolve collisions. Assuming there is a collision, resolution works as follows:

The full (32-bit) hash of an object is used to seed a linear congruential random number generator.
The generator generates 32-bit numbers and the modulus is taken to determine where in the hash table to probe next.

This has the very nice property that, regardless of how many hash collisions there are in modulus space, lookup and insertion times are expected to be O(1) as long as there are few collisions in full 32-bit hash space. Because the probe sequence is pseudo-random, no clustering behavior results from modulus space collisions, unlike with linear probing. Because the entire system is open-addressed and doesn't use linked lists anywhere, you don't need to perform a memory allocation on each insertion, unlike chaining.

Furthermore, because the size of the hash is usually the size of the address space (32 bits on 32-bit machines), it is simply impossible to fit enough items in address space to cause large numbers of hash collisions in full 32-bit hash space under a good hashing scheme.

Why, then, is randomized probing such an unpopular collision resolution strategy?

Teferi answered 10/11, 2009 at 18:14 Comment(0)

W

6

Python's dictionary implementation does this. A very nice comment in dictobject.c says:

...
The first half of collision resolution is to visit table indices via this
recurrence:

    j = ((5*j) + 1) mod 2**i

For any initial j in range(2**i), repeating that 2**i times generates each
int in range(2**i) exactly once (see any text on random-number generation for
proof).
...

Sure looks like a linear congruential RNG to me!

Note that the full state of such an RNG is only i bits--has to be, to avoid revisiting entries--so you can't meaningfully use "[t]he full (32-bit) hash of an object" to seed the RNG. Python initially seeds j with i bits from the hash. If there is another collision, it grabs another 5 bits from the hash and throws those into the mix. (Read the rest of that comment, particularly where it talks about PERTURB_SHIFT.) It continues that way, adding more bits with each collision, until it has used up the whole hash code. This way Python uses a decent amount of whatever randomness the hash code offers, and the code is simple and fast.

This is some of the finest code I've ever read. It's featured in chapter 18 of Beautiful Code. So I'd say you're on to something!

Walling answered 19/11, 2009 at 22:16 Comment(1)

Interesting. My implementation simply doesn't guarantee that it will never visit the same slot more than once. I did some monte carlo simulations and concluded that, in practice, this happens too infrequently to worry about. Even if you allow for visiting the same slot more than once, you still get expected O(1) lookup time. – Teferi 19/11, 2009 at 22:49

F

7

One of the reasons for using linear lookup (such as double hasing) is cache locality. By making the second (rehash) function to be an addition of a small integer, most chances are that you'll hit the same cache line. It is very significant for large hashes.

Chain hashing is probably used due to its simplicity.

Frants answered 10/11, 2009 at 18:32 Comment(0)

W

6

Python's dictionary implementation does this. A very nice comment in dictobject.c says:

...
The first half of collision resolution is to visit table indices via this
recurrence:

    j = ((5*j) + 1) mod 2**i

For any initial j in range(2**i), repeating that 2**i times generates each
int in range(2**i) exactly once (see any text on random-number generation for
proof).
...

Sure looks like a linear congruential RNG to me!

Note that the full state of such an RNG is only i bits--has to be, to avoid revisiting entries--so you can't meaningfully use "[t]he full (32-bit) hash of an object" to seed the RNG. Python initially seeds j with i bits from the hash. If there is another collision, it grabs another 5 bits from the hash and throws those into the mix. (Read the rest of that comment, particularly where it talks about PERTURB_SHIFT.) It continues that way, adding more bits with each collision, until it has used up the whole hash code. This way Python uses a decent amount of whatever randomness the hash code offers, and the code is simple and fast.

This is some of the finest code I've ever read. It's featured in chapter 18 of Beautiful Code. So I'd say you're on to something!

Walling answered 19/11, 2009 at 22:16 Comment(1)

Interesting. My implementation simply doesn't guarantee that it will never visit the same slot more than once. I did some monte carlo simulations and concluded that, in practice, this happens too infrequently to worry about. Even if you allow for visiting the same slot more than once, you still get expected O(1) lookup time. – Teferi 19/11, 2009 at 22:49

G

4

Possible reasons are that linear or quadratic probing

have the same worst-case time complexity (O(size of the table))
have the same best-case time complexity (O(1))
are easier to implement
are faster than a good RNG (since speed is a major selling point for hashtables)

But I'm not sure. Did you implement your own hashtable with another collision resolution and compare the two under different circumstances? That would be very enlightening.

Gentianaceous answered 10/11, 2009 at 18:30 Comment(0)

A

0

Wouldn't you have the problem that for insertions into a non-sparsely populated table there's no guarantee that you'll hit all the elements of the hash table before starting to iterate over duplicate elements?

As a result insertion time wouldn't be well defined.

Agora answered 10/11, 2009 at 18:23 Comment(2)

Given a good RNG, it should visit all buckets equally often. Insertion time isn't well defined either for linear and quadratic probing. – Gentianaceous 10/11, 2009 at 18:25

You can bound the probability of that happening :) – Frants 10/11, 2009 at 18:27

F

0

I think the reason random hashing isn't used much is that hash collisions when a small hash value is computed from a 32-bit hash are apt to be rare unless there's something "wrong" with the hash function, and in that case there's a fair likelihood that all 32 bits of the hash function will match (e.g. because only part of the key was used in computing the hash). If hash functions are decent, and load factors are reasonably low, linear and quadratic probing offer good cache locality (remember that the majority of hash collisions will be resolved by looking at only one extra item, which will with both linear and quadratic probes be the one that follows the first guess). Linear probe offers somewhat better performance in the case where all keys map to the same value, and sometimes even if they map to a small number of values. Chain-bucket hashing allows easy item removal.

Fenton answered 7/1, 2011 at 23:56 Comment(0)

Recommended topics

Hot tags