What is a sensible prime for hashcode calculation?
Asked Answered
V

6

71

Eclipse 3.5 has a very nice feature to generate Java hashCode() functions. It would generate for example (slightly shortened:)

class HashTest {
    int i;
    int j;        
    public int hashCode() {
        final int prime = 31;
        int result = prime + i;
        result = prime * result + j;
        return result;
    }
}

(If you have more attributes in the class, result = prime * result + attribute.hashCode(); is repeated for each additional attribute. For ints .hashCode() can be omitted.)

This seems fine but for the choice 31 for the prime. It is probably taken from the hashCode implementation of Java String, which was used for performance reasons that are long gone after the introduction of hardware multipliers. Here you have many hashcode collisions for small values of i and j: for example (0,0) and (-1,31) have the same value. I think that is a Bad Thing(TM), since small values occur often. For String.hashCode you'll also find many short strings with the same hashcode, for instance "Ca" and "DB". If you take a large prime, this problem disappears if you choose the prime right.

So my question: what is a good prime to choose? What criteria do you apply to find it?

This is meant as a general question - so I do not want to give a range for i and j. But I suppose in most applications relatively small values occur more often than large values. (If you have large values the choice of the prime is probably unimportant.) It might not make much of a difference, but a better choice is an easy and obvious way to improve this - so why not do it? Commons lang HashCodeBuilder also suggests curiously small values.

(Clarification: this is not a duplicate of Why does Java's hashCode() in String use 31 as a multiplier? since my question is not concerned with the history of the 31 in the JDK, but on what would be a better value in new code using the same basic template. None of the answers there try to answer that.)

Verger answered 2/12, 2009 at 21:35 Comment(7)
31 is still good as it doesn't necessarily involve loading a constant. On an ARM processor (at least one used by around 99.9997% of mobile phones) *31 can be don in a single instruction. In reality, any odd number whether prime or not is good enough.Unthrone
I was thinking of desktop programs, where it does not matter whether you choose 31 or 1327144003. Curiously enough, on my machine multiplying with 31 is actually a little slower - probably an optimization gone wrong. 8-)Whittle
Primes of form p = (2^n-1) lend themselves to optimization of x * p = (p << n) - p which the compiler typically does. From Joshua Bloch, Effective Java, Chapter 3, Item 9. SO question #299804Hallsy
and multiply with integer <128 have extra boost in jvm.. 2^n-1, prime, smallish .. this give 31.Cascara
@Hallsy As I said, for current desktop machines this doesn't seem to be an optimization anymore - the time is the same. Even worse: on my machine multiplying with 31 was a little slower - maybe the JVM tried to "optimize" it by calculating x << 5 - x, and this is actually slower than using the hardware multiplier.Whittle
@Dr.Hans-PeterStörr On i86, there's a difference, as there's a mode for a single byte immediate operand. You get a shorter instruction and in a benchmark I wrote years ago it was slightly faster.Vaginitis
@MarkRotteveel Please notice that this is quite different from [Why does Java's hashCode() in String use 31 as a multiplier?][1] since this is not about the history of 31, but on what would be a better choice instead of using 31, without using additional libraries or entirely different methods of calculating hashes. None of the answers there adresses that. [1]: #299804Whittle
V
89

I recommend using 92821. Here's why.

To give a meaningful answer to this you have to know something about the possible values of i and j. The only thing I can think of in general is, that in many cases small values will be more common than large values. (The odds of 15 appearing as a value in your program are much better than, say, 438281923.) So it seems a good idea to make the smallest hashcode collision as large as possible by choosing an appropriate prime. For 31 this rather bad - already for i=-1 and j=31 you have the same hash value as for i=0 and j=0.

Since this is interesting, I've written a little program that searched the whole int range for the best prime in this sense. That is, for each prime I searched for the minimum value of Math.abs(i) + Math.abs(j) over all values of i,j that have the same hashcode as 0,0, and then took the prime where this minimum value is as large as possible.

Drumroll: the best prime in this sense is 486187739 (with the smallest collision being i=-25486, j=67194). Nearly as good and much easier to remember is 92821 with the smallest collision being i=-46272 and j=46016.

If you give "small" another meaning and want to be the minimum of Math.sqrt(i*i+j*j) for the collision as large as possible, the results are a little different: the best would be 1322837333 with i=-6815 and j=70091, but my favourite 92821 (smallest collision -46272,46016) is again almost as good as the best value.

I do acknowledge that it is quite debatable whether these calculation make much sense in practice. But I do think that taking 92821 as prime makes much more sense than 31, unless you have good reasons not to.

Verger answered 12/5, 2010 at 7:26 Comment(10)
You're looking for a magic number for a perfect hash, or a nearly perfect one at any rate. I'd be more interested in seeing a solution for arbitrary inputs up to the hash size (eg, 4 2-byte values in an 8 byte hashcode), than this particular case of simple transposition.Temperamental
8 byte hashcode? At least in Java this is 4 bytes. Anyway: you could just continue the scheme that is used in eclipse hashCode generation: result = prime * result + i; result = prime * result + j; and so forth. For this 92821 is probably a good choice as prime - at least much better than the eclipse default 31.Whittle
Not only is using a small constant wrong, re-using it is also wrong, as you get collisions like newArrayList("a", "bc").hashCode() == newArrayList("ab", "c").hashCode() (my example mayn't work, but something similar does).Vaginitis
@Vaginitis You are right that there are many much better hash algorithms out there. I was just trying to point out an easy but worthwhile improvement to an often used simple algorithm. If you want really good properties, there are libraries for that which are much better, but that's often overkill.Whittle
@Dr.Hans-PeterStörr What I mean: Even if we call n a good constant, using it everywhere can't be right as it produces needless conflicts in a systematic way. Using one constant for strings and a different one for lists is IMHO better.Vaginitis
@Vaginitis - Hmm, interesting hypothesis. However, my instinct is that the nature of the distribution for different types is so different, that the issue you mention isn't what dominates. I'm more concerned by what I thought you were bringing up: is it dubious to repeat use of the same prime, for each additional input value? I'm thinking one should use different primes at each multiply-add step.Aleksandr
@Aleksandr Primes aren't actually any better than other odd numbers (unless you use both small keys and a prime-sized hash table, which you most probably don't). If you want to use more constants, then you may also replace the multiply-add step by sum of multiples and get a factor of nearly three speedup as a bonus. If I was to write a generator, I'd derive the multipliers from both the class name and field name (or index). The problem is how to detect a good multiplier. It must be odd, big and have funny bit patterns. ;)Vaginitis
@Vaginitis - it would surprise me if using different constants for different classes had a significant (> 10%) benefit, except in exceedingly rare situations. But if you were going to do that, don't try to "detect" a good multiplier: just make a table of 256 of them, and XOR the bytes of the hash of the class name to get index.Aleksandr
@Aleksandr I also doubt, 10% is doable. For an application, it's improbably worth the effort. If we could redesign the whole Java hashing, then 10% might be achievable (avoiding stupid collisions like the hashCode being zero for any new Map.Entry with equal key and value, etc.) while even 0.1% being probably a worthy improvement.Vaginitis
@Aleksandr The idea to use multiple primes is surely a good one that costs nothing, though I'm a bit at loss how to tell which are good ones. But perhaps any with more than 16 bits will do just fine. By the way, since you care about that topic: could you perhaps vote to reopen the question? Someone wrongly marked this as a duplicate, needlessly preventing any new interesting answers, if there are some.Whittle
N
6

Actually, if you take a prime so large that it comes close to INT_MAX, you have the same problem because of modulo arithmetic. If you expect to hash mostly strings of length 2, perhaps a prime near the square root of INT_MAX would be best, if the strings you hash are longer it doesn't matter so much and collisions are unavoidable anyway...

Nubilous answered 2/12, 2009 at 21:54 Comment(1)
Right, the modulo arithmetic makes the problem difficult and interesting. I think I'll write a little program to search for a good solution. :-)Whittle
B
5

Collisions may not be such a big issue... The primary goal of the hash is to avoid using equals for 1:1 comparisons. If you have an implementation where equals is "generally" extremely cheap for objects that have collided hashs, then this is not an issue (at all).

In the end, what is the best way of hashing depends on what you are comparing. In the case of an int pair (as in your example), using basic bitwise operators could be sufficient (as using & or ^).

Blurt answered 2/12, 2009 at 23:20 Comment(3)
Of course it does not matter much, but changing the prime it is an obvious and easy way to improve things. So why not do it?Whittle
Agreed. I primarily meant to put a bit of emphasis on the fact using primes is not the only way of doing things, as the question ultimately has a very "generic" scope.Blurt
BTW: Using && would be very bad since that tends to decrease the number of bits set after each step. Using ^ is better but, as someone pointed out, using i ^ j would mean the result is 0 if they are equal, which is intuitively also a rather common case.Whittle
W
4

You need to define your range for i and j. You could use a prime number for both.

public int hashCode() {
   http://primes.utm.edu/curios/ ;)
   return 97654321 * i ^ 12356789 * j;
}
Wryneck answered 2/12, 2009 at 21:52 Comment(0)
H
4

I'd choose 7243. Large enough to avoid collissions with small numbers. Doesn't overflow to small numbers quickly.

Hathorn answered 2/12, 2009 at 22:11 Comment(2)
I use the first 1000 primes as a handy source for small prime numbers primes.utm.edu/lists/small/1000.txtSphery
I don't think overflowing matters - if the prime is large enough, the result will be large even after the overflow. I was thinking of something like 1327144003.Whittle
F
1

I just want to point out that hashcode has nothing to do with prime. In JDK implementation

for (int i = 0; i < value.length; i++) {
                h = 31 * h + val[i];
            }

I found if you replace 31 with 27, the result are very similar.

Fulcrum answered 15/10, 2016 at 5:25 Comment(5)
Primes are an easy way to ensure that every hashcode actually occurs, so that you don't waste any bit if the integer space to distribute them widely. I'm not quite sure whether there are other advandages. But you're right that 27 probably does that as well. So it's about equally bad as the original choice 31 - you'll get very small hash code colissions, too. ;-)Whittle
@Dr.Hans-PeterStörr For hash tables of sizes which are powers of two, all you need is an odd multiplier, prime or not. Prime multipliers are important for prime-sized tables as they don't have any factor in common (unless you're unlucky enough to use the same prime :D). AFAIK the only use of a prime-sized table in JDK is in String#intern.Vaginitis
@Vaginitis An odd multiplier is needed / sufficient for what exactly? As I discussed, hashcode collisions are bad for the performance and small multipliers generate more hashcode collisions, since small values for the attributes are more likely than large values.Whittle
@Dr.Hans-PeterStörr An odd multiplier is necessary in order not to lose information (the worst multipliers are those ending with many zeros in binary). Losing information is obviously bad and trivial to avoid. +++ We agree that small multipliers are also bad. +++ My point was primarity. A multiplier like m = 101*103*107*109 is a disaster for a hash table of size 103 (but nobody uses such sizes). For a power-of-two-sized table it's most probably much better than 31. So is it probably for a table of a size co-prime to m.Vaginitis
@Vaginitis Yes, that's the obvious property the multiplier should satisfy. I was trying to point that you can easily make it better if you look a little further, and reduce the hash code collisions by putting just a little more thought into it. And these hurt performance no matter what the table size is.Whittle

© 2022 - 2024 — McMap. All rights reserved.