Why use a prime number in hashCode?

O

9

207

I was just wondering why is that primes are used in a class's hashCode() method? For example, when using Eclipse to generate my hashCode() method there is always the prime number 31 used:

public int hashCode() {
     final int prime = 31;
     //...
}

References:

Here is a good primer on Hashcode and article on how hashing works that I found (C# but the concepts are transferrable): Eric Lippert's Guidelines and rules for GetHashCode()

Om answered 31/8, 2010 at 20:46 Comment(3)

related: Why does Java's hashCode() in String use 31 as a multiplier? – Lasonyalasorella 2/9, 2010 at 1:50

This is more or less a duplicate of question #1145717 . – Pravit 16/8, 2012 at 7:35

Please check my answer at #1145717 It is related to the properties of polynomials over a field (not a ring!), hence prime numbers. – Clemons 26/11, 2013 at 1:8

J

113

Because you want the number you are multiplying by and the number of buckets you are inserting into to have orthogonal prime factorizations.

Suppose there are 8 buckets to insert into. If the number you are using to multiply by is some multiple of 8, then the bucket inserted into will only be determined by the least significant entry (the one not multiplied at all). Similar entries will collide. Not good for a hash function.

31 is a large enough prime that the number of buckets is unlikely to be divisible by it (and in fact, modern java HashMap implementations keep the number of buckets to a power of 2).

Jagir answered 31/8, 2010 at 21:30 Comment(9)

What if there's 31 or 62 buckets (or some multiple of 31) then? – Loma 31/8, 2010 at 21:37

Then a hash function that multiplies by 31 will perform non-optimally. However, I would consider such a hash table implementation poorly designed, given how common 31 as a multiplier is. – Jagir 31/8, 2010 at 21:42

So 31 is chosen based on the assumption that hash table implementors know that 31 is commonly used in hash codes? – Loma 31/8, 2010 at 21:50

31 is chosen based on the idea that most implementations have factorizations of relatively small primes. 2s, 3s and 5s usually. It may start at 10 and grow 3X when it gets too full. The size is rarely entirely random. And even if it were, 30/31 are not bad odds for having well synced hash algorithms. It may also be easy to calculate as others have stated. – Jagir 31/8, 2010 at 21:55

In other words... we need to know something about the set of input values and the set's regularities, in order to write a function that's designed to strip them of those regularities, so the values in the set don't collide in the same hash buckets. Multiplying/Dividing/Moduloing by a prime number achieves that affect, because if you have a LOOP with X-items and you jump Y-spaces in the loop, then you'll never return to the same spot until X becomes a factor of Y. Since X is often an even number or power of 2, then you need Y to be prime so X+X+X... is not a factor of Y, so 31 yay! :/ – Offwhite 16/12, 2010 at 14:58

31 is IMHO a quite bad choice since it allows for hashcode collisions as short as "Ca" and "DB". See here for a discussion of bettter choices: https://mcmap.net/q/30342/-what-is-a-sensible-prime-for-hashcode-calculation – Pravit 16/8, 2012 at 7:33

Suppose there are 8 buckets to insert into. If the number you are using to multiply by is some multiple of 8, then the bucket inserted into will only be determined by the least significant entry (the one not multiplied at all). .....can you give an example ? – Nonsuch 4/10, 2017 at 3:13

@FrankQ. It is the nature of modular arithmetic. (x*8 + y) % 8 = (x*8) % 8 + y % 8 = 0 + y % 8 = y % 8 – Jagir 4/10, 2017 at 16:36

@SteveKuo "What if there's 31 or 62 buckets (or some multiple of 31) then?" You always use a Prime bigger then the number of Buckets. it is not like we do not know of 10 digit primenumbers. Cryptography needs big primenumbers anyway, so they always look for "the next big thing". – Xylotomy 6/2, 2018 at 11:30

G

158

Prime numbers are chosen to best distribute data among hash buckets. If the distribution of inputs is random and evenly spread, then the choice of the hash code/modulus does not matter. It only has an impact when there is a certain pattern to the inputs.

This is often the case when dealing with memory locations. For example, all 32-bit integers are aligned to addresses divisible by 4. Check out the table below to visualize the effects of using a prime vs. non-prime modulus:

Input       Modulo 8    Modulo 7
0           0           0
4           4           4
8           0           1
12          4           5
16          0           2
20          4           6
24          0           3
28          4           0

Notice the almost-perfect distribution when using a prime modulus vs. a non-prime modulus.

However, although the above example is largely contrived, the general principle is that when dealing with a pattern of inputs, using a prime number modulus will yield the best distribution.

Gherkin answered 31/8, 2010 at 21:38 Comment(4)

Aren't we talking about the multiplier used to generate the hash code, not the modulo used to sort those hash codes into buckets? – Jagir 31/8, 2010 at 21:50

Same principle. In terms of I/O, the hash feeds into the hash table's modulo operation. I think the point was that if you multiply by primes, you'll get more randomly distributed inputs to the point where the modulo won't even matter. Since the hash function picks up the slack of distributing the inputs better, making them less regular, they are less likely to collide, regardless of the modulo used to place them into a bucket. – Offwhite 16/12, 2010 at 14:43

This kind of answer is very useful because it's like teaching someone how to fish, rather than catching one for them. It helps people see and understand the underlying principle behind using primes for hashes... which is to distribute inputs irregularly so they fall uniformly into buckets once moduloed :). – Offwhite 16/12, 2010 at 14:44

This should be the answer. And the follow up questions in the above comments are excellent too (on why whether the prime being the multiplier or the modulus essentially doesn't make much of a difference). – Indomitable 8/9, 2020 at 5:26

J

113

Because you want the number you are multiplying by and the number of buckets you are inserting into to have orthogonal prime factorizations.

Suppose there are 8 buckets to insert into. If the number you are using to multiply by is some multiple of 8, then the bucket inserted into will only be determined by the least significant entry (the one not multiplied at all). Similar entries will collide. Not good for a hash function.

31 is a large enough prime that the number of buckets is unlikely to be divisible by it (and in fact, modern java HashMap implementations keep the number of buckets to a power of 2).

Jagir answered 31/8, 2010 at 21:30 Comment(9)

What if there's 31 or 62 buckets (or some multiple of 31) then? – Loma 31/8, 2010 at 21:37

Then a hash function that multiplies by 31 will perform non-optimally. However, I would consider such a hash table implementation poorly designed, given how common 31 as a multiplier is. – Jagir 31/8, 2010 at 21:42

So 31 is chosen based on the assumption that hash table implementors know that 31 is commonly used in hash codes? – Loma 31/8, 2010 at 21:50

31 is chosen based on the idea that most implementations have factorizations of relatively small primes. 2s, 3s and 5s usually. It may start at 10 and grow 3X when it gets too full. The size is rarely entirely random. And even if it were, 30/31 are not bad odds for having well synced hash algorithms. It may also be easy to calculate as others have stated. – Jagir 31/8, 2010 at 21:55

In other words... we need to know something about the set of input values and the set's regularities, in order to write a function that's designed to strip them of those regularities, so the values in the set don't collide in the same hash buckets. Multiplying/Dividing/Moduloing by a prime number achieves that affect, because if you have a LOOP with X-items and you jump Y-spaces in the loop, then you'll never return to the same spot until X becomes a factor of Y. Since X is often an even number or power of 2, then you need Y to be prime so X+X+X... is not a factor of Y, so 31 yay! :/ – Offwhite 16/12, 2010 at 14:58

31 is IMHO a quite bad choice since it allows for hashcode collisions as short as "Ca" and "DB". See here for a discussion of bettter choices: https://mcmap.net/q/30342/-what-is-a-sensible-prime-for-hashcode-calculation – Pravit 16/8, 2012 at 7:33

Suppose there are 8 buckets to insert into. If the number you are using to multiply by is some multiple of 8, then the bucket inserted into will only be determined by the least significant entry (the one not multiplied at all). .....can you give an example ? – Nonsuch 4/10, 2017 at 3:13

@FrankQ. It is the nature of modular arithmetic. (x*8 + y) % 8 = (x*8) % 8 + y % 8 = 0 + y % 8 = y % 8 – Jagir 4/10, 2017 at 16:36

@SteveKuo "What if there's 31 or 62 buckets (or some multiple of 31) then?" You always use a Prime bigger then the number of Buckets. it is not like we do not know of 10 digit primenumbers. Cryptography needs big primenumbers anyway, so they always look for "the next big thing". – Xylotomy 6/2, 2018 at 11:30

E

35

For what it's worth, Effective Java 2nd Edition hand-waives around the mathematics issue and just say that the reason to choose 31 is:

Because it's an odd prime, and it's "traditional" to use primes
It's also one less than a power of two, which permits for bitwise optimization

Here's the full quote, from Item 9: Always override hashCode when you override equals:

The value 31 was chosen because it's an odd prime. If it were even and multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional.

A nice property of 31 is that the multiplication can be replaced by a shift (§15.19) and subtraction for better performance:
 31 * i == (i << 5) - i
Modern VMs do this sort of optimization automatically.

While the recipe in this item yields reasonably good hash functions, it does not yield state-of-the-art hash functions, nor do Java platform libraries provide such hash functions as of release 1.6. Writing such hash functions is a research topic, best left to mathematicians and theoretical computer scientists.

Perhaps a later release of the platform will provide state-of-the-art hash functions for its classes and utility methods to allow average programmers to construct such hash functions. In the meantime, the techniques described in this item should be adequate for most applications.

Rather simplistically, it can be said that using a multiplier with numerous divisors will result in more hash collisions. Since for effective hashing we want to minimize the number of collisions, we try to use a multiplier that has fewer divisors. A prime number by definition has exactly two distinct, positive divisors.

Related questions

Recommended topics

Hot tags