What property of the bit pattern is it that causes collisions?
Asked Answered
A

2

6

I was reading about Java's approach to randomize hash keys here
Apparently the idea is to make sure that the lower bits are "random" to help the distribution but I am trying to understand this more.
So if we have a table of size 10 then the numbers 0,10,20,30,40 etc all fall in bucket 0, the numbers 1,11,21,31 etc fall in bucket 1 etc (using modulo 10).
So changing the bit patterns you could make these go to different buckets instead of all going to bucket 0.
But what I am not clear about is what property is it that makes the low order bits affect this and we need to randomize them. So we have:

0000 0000 (0)  
0000 1010 (10)  
0001 0100 (20) 
0001 1110 (30)  
0010 1000 (40) 

What is the regularity in the low order bits that makes them placed to the same slot?
Perhaps I am confused on the following? My understanding is that it is some regularity in low order bits that cause collisions and we try to randomize bit to compensate

Alb answered 4/6, 2015 at 18:34 Comment(0)
K
2

Some hash functions do a really bad job of randomizing low order bits.

One classic case is the use of hardware addresses as a hash for object references ("pointers" in C), which would otherwise be a reasonable way of cheaply getting a unique number for an object-id. This would work fine if the hash table's bucket count were a prime number, but for hash implementations where the number of buckets is always a power of 2, the fact that all the hashes are divisible by 8 would mean that most buckets were empty.

That's an extreme case, but any time that the data to be hashed is not uniformly distributed and the hash function tends to preserve low-order bits, you'll find some bias in the bucket assignments.

Kymberlykymograph answered 4/6, 2015 at 19:4 Comment(2)
I am not clear on this: ..but for hash implementations where the number of buckets is always a power of 2, the fact that all the hashes are divisible by 8 would mean that most buckets were empty. What is 8? The size of the address? And why does it happen for size of power of 2? Could you please elaborate a bit on this?Alb
@Jim: 8 is (an example) of typical hardware alignment: almost all objects have addresses divisible by 8, because the CPU can read eight aligned bytes in a single access (whereas if the object were split over the boundary, it would take two memory accesses). And if you reduce a number divisible by eight modulo some power of 2, you end up with a value divisible by eight, so seven out of every eight buckets would be unused.Kymberlykymograph
F
2

Java's HashMap uses a hash table sizes that are powers of two. If you use the remainder/modulo operation as the compression function as usual, you end up taking the lowest bits of the hash code as the bucket index. If the hash codes happen to be multiples of a power two, some of the lowest bits will always be zero, and you end up using a fraction of the available buckets.

Concrete example: Suppose you have 32 buckets and the hash codes are multiples of 8. The table uses only the 5 least significant bits of the code, and 3 of them are always 0. Therefore only 2 bits determine the bucket, and you use only 4 of the 32 buckets:

XXXXXX00000 -> bucket 0
XXXXXX01000 -> bucket 8
XXXXXX10000 -> bucket 16
XXXXXX11000 -> bucket 24

Fortunately things are not this bad in Java because HashMap doesn't use just the lowest bits of the hash code: it scrambles the bits so that it's not quite as easy to produce bad scenarios accidentally. Here's an excerpt from OpenJDK's HashMap implementation:

/**
 * Applies a supplemental hash function to a given hashCode, which
 * defends against poor quality hash functions.  This is critical
 * because HashMap uses power-of-two length hash tables, that
 * otherwise encounter collisions for hashCodes that do not differ
 * in lower bits. Note: Null keys always map to hash 0, thus index 0.
 */
static int hash(int h) {
    // This function ensures that hashCodes that differ only by
    // constant multiples at each bit position have a bounded
    // number of collisions (approximately 8 at default load factor).
    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
}
Fluoride answered 4/6, 2015 at 19:28 Comment(4)
I don't understand clearly this statement Java's HashMap uses a hash table sizes that are powers of two, which means is that it basically takes the lowest bits of the hash code as the bucket index. Could you please elaborate a bit on this?Alb
I've expanded that a bit. Usually you have more hash codes than buckets, so you use a "compression function" to map hash codes to buckets. The common choice of compression function is to compute the remainder of the code when divided by the number of buckets. If the number of buckets is 2^N the result is the lowest N bits of the hash code.Fluoride
Thank you for your update.This is a problem when using a power of 2 right? So prime size causes better distribution but it causes a problem to grow to the next prime size because it is slower?Alb
Prime sizes do have that problem, and also require using the remainder operation to compute the bucket index. This is slower than the bitwise AND that you would use with powers of two. The hash table in the C# standard library uses prime sizes though.Fluoride

© 2022 - 2024 — McMap. All rights reserved.