Why does Java's hashCode() in String use 31 as a multiplier?
Asked Answered
M

13

586

Per the Java documentation, the hash code for a String object is computed as:

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

using int arithmetic, where s[i] is the ith character of the string, n is the length of the string, and ^ indicates exponentiation.

Why is 31 used as a multiplier?

I understand that the multiplier should be a relatively large prime number. So why not 29, or 37, or even 97?

Maculation answered 18/11, 2008 at 16:39 Comment(5)
Compare also #1836476 - I think 31 is a bad choice if you write your own hashCode functions.Belt
If it was 29, or 37, or even 97, you would be asking 'why not 31?'Lynnlynna
@EJP it is important to know the reason behind the choice of a no. unless the number is outcome of a black magic trick.Cavein
There is a blog post by @peter-lawrey about it here: vanilla-java.github.io/2018/08/12/… and here: vanilla-java.github.io/2018/08/15/…Horned
@DushyantSabharwal My point is that it could have been 29 or 37 or 97, or 41, or many other values, without making much practical difference. We were using 37 in 1976.Lynnlynna
R
483

According to Joshua Bloch's Effective Java, Second Edition (a book that can't be recommended enough, and which I bought thanks to continual mentions on Stack Overflow):

The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically.

(from Chapter 3, Item 9: Always override hashCode when you override equals, page 48)

Rooney answered 18/11, 2008 at 18:53 Comment(16)
Well all primes are odd, except 2. Just sayin.Erythrism
I don't think Bloch is saying it was chosen because it was an odd prime, but because it was odd AND because it was prime (AND because it can easily be optimized into a shift/subtract).Rooney
31 was chosen coz it is an odd prime??? That doesnt make any sense - I say 31 was chosen because it gave the best distribution - check computinglife.wordpress.com/2008/11/20/…Macneil
I think the choice of 31 is rather unfortunate. Sure, it might save a few CPU cycles on old machines, but you have hash collisions already on short ascii strings like "@ and #! , or Ca and DB . This does not happen if you choose, for instance, 1327144003, or at least 524287 which also allows bitshift: 524287 * i == i << 19 - i.Belt
But why multiply it just pushes the low order bits from the item to the left, after enough items all your bits are pushed left and overflow. perhaps if they underflowed then things might be better.Electron
@hstoerr: I'd like to see your math on that. Even if you're right (which you most likely are for the 2 digit example), I think if you look at how hashes are used in Java, it doesn't really hurt things very much for there to be collisions in very short strings. They're not used for keys all that often.Reinforce
@Reinforce See my answer #1836476 . My point is: you get much less collisions if you use a larger prime, and lose nothing these days. The problem is worse if you use non-english languages with common non-ascii chars. And 31 served as a bad example for many programmers when writing their own hashCode functions.Belt
Does the optimisation really help now, given arithmetic units in modern processors?Unprofessional
@hstoerr I fully agree with your. Using 31 was a terrible idea. There should be no collisions for strings of length two and no collisions for ASCII strings of length four, but there are plenty. Reusing 31 everywhere is even worse, see e.g. this tiny rant of mine.Commercialism
@RichardCorfield I'm unsure if it makes sense, but the compiler still use it (see the comments to this answer).Commercialism
Every math student should know why it is prime --- It forms a group if and only if it is coprime with the group size.Storey
I am not sure why was this accepted as an answer. I mean it is simply copy paste from a reputed book!! A non math and a non computer science graduate like me was looking for an answer from a layman perspective. I am sorry but the answer is too intellectual for me. I have to do more google search now.Prynne
How is "multiplication by even number overflowed" ? Overflow can happen even with odd numbers correct ?Carolacarolan
@FrankQ. The issue isn't overflowing: that is inevitable, as you say. The issue is that multiplying by an even number guarantees that fewer bits contain "varying" information - the low bit becomes zero. Always zero. You've lost one bit of "variability". The result is a poorer distribution over possible hash values.Exonerate
Note that Google's AutoValue chose a much larger prime (1000003) for value object hash code calculation, since they found it to perform better than 31: https://mcmap.net/q/30371/-why-autovalue-annotation-uses-the-specific-integer-1000003-for-calculating-hash-codeKeeshakeeshond
isn't 31 * i == (i << 5) - i shifting involved here as well? How will this not overflow compared to i << 2 ?Dorweiler
S
103

Goodrich and Tamassia computed from over 50,000 English words (formed as the union of the word lists provided in two variants of Unix) that using the constants 31, 33, 37, 39, and 41 will produce fewer than 7 collisions in each case. This may be the reason that so many Java implementations choose such constants.

See section 9.2 Hash Tables (page 522) of Data Structures and Algorithms in Java.

Strongbox answered 18/11, 2008 at 20:56 Comment(1)
Note however that you might get WAY more collisions if you use any kind of international charset with common characters outside the ASCII range. At least, I checked this for 31 and German. So I think the choice of 31 is broken.Belt
E
61

On (mostly) old processors, multiplying by 31 can be relatively cheap. On an ARM, for instance, it is only one instruction:

RSB       r1, r0, r0, ASL #5    ; r1 := - r0 + (r0<<5)

Most other processors would require a separate shift and subtract instruction. However, if your multiplier is slow this is still a win. Modern processors tend to have fast multipliers so it doesn't make much difference, so long as 32 goes on the correct side.

It's not a great hash algorithm, but it's good enough and better than the 1.0 code (and very much better than the 1.0 spec!).

Edie answered 18/11, 2008 at 17:1 Comment(12)
Funny enough, the multiplication with 31 is on my desktop machine actually a little bit slower than multiplication with, say, 92821. I guess the compiler tries to "optimize" it into shift and add as well. :-)Belt
I don't think I've ever used an ARM which was not equally fast with all values in the range +/-255. Use of a power of 2 minus one has the unfortunate effect that a matching change to two values changes the hash code by a power of two. A value of -31 would have been better, and I would think something like -83 (64+16+2+1) might have been better yet (blenderize bits somewhat better).Skive
@Skive Not convinced by the minus. Seems you'd be heading back towards zeros. / String.hashCode predates the StrongARM which, IIRC, introduced an 8-bit multiplier and possibly increased to two cycles for the combined arithmetic/logical with shift operations.Edie
@TomHawtin-tackline: Using 31, the hash of four values would be 29791*a + 961*b + 31*c + d; using -31, it would be -29791*a + 961*b - 31*c + d. I don't think the difference would be significant if the four items are independent, but if pairs of adjacent items match, the resulting hash code will be the contribution of all unpaired items, plus some multiple of 32 (from the paired ones). For strings it may not matter too much, but if one is writing a general-purpose method for hashing aggregations, the situation where adjacent items match will be disproportionately common.Skive
The situation isn't quite so bad as the one resulting by taking the xor of items which are supposed to be unsequenced (as opposed to using an arithmetic sum). If one has an UnorderedPair where A and B things are "equal" if (a.first.equals(b.first) && a.second.equals(b.second)) || ((a.first.equals(b.second) && a.second.equals(b.first)), using a hash of first.hashCode()+second.hashCode() will lose one bit of information if first and second match; using first.hashCode() ^ second.hashCode() (which I've seen done) would lose all 32 bits of information.Skive
@Skive Yeah, it seems to me + gives a bit of "blurring" across adjacent bits which can be further shuffling. ^ doesn't have that, so seems a poorer choice. I believe early protocols used xor in check codes as it was mistakenly believed that hardware implementations would be easier by a worthwhile factor.Edie
@TomHawtin-tackline: There are a number of advantages to using xor in hardware implementations. Further, behavior of xor is orthogonal to that of addition and multiplication, using it in combination with those other methods may improve "blending". The big thing to watch out for in any event is cases where the disproportionate appearance of certain patterns of input (e.g. pairs of things which match) will cause some hash codes to appear much more often than they should.Skive
@Skive fun fact, the hash code of Map.Entry has been fixed by specification to be key.hashCode() ^ value.hashCode() despite it is not even an unordered pair, as key and value have entirely different meaning. Yes, that implies that Map.of(42, 42).hashCode() or Map.of("foo", "foo", "bar", "bar").hashCode(), etc, are predictably zero. So don’t use maps as keys for other maps…Megara
@Holger: What's sad about that is that using + rather than ^ would have been just as fast, but only lose one bit of hashcode (and sometimes not even that, since some types' hashcodes are never negative).Skive
@Skive and the performance losses due to hash collisions outclass any savings of a few CPU cycles in the hash calculation anyway…Megara
@Holger: Using a complicated hash to prevent the loss of a hash bit from x+y would probably not pay off, but a hash that returns zero for all keys is just plain tragic.Skive
@Skive exactly, you have make the right trade off. Even on a hypothetical CPU that could save a cycle when using ^ instead of +, the significantly higher number of collisions would eat that advantage. It’s only when the key and value have the same hash code, but it’s not too far-fetched to have such mappings.Megara
U
44

By multiplying, bits are shifted to the left. This uses more of the available space of hash codes, reducing collisions.

By not using a power of two, the lower-order, rightmost bits are populated as well, to be mixed with the next piece of data going into the hash.

The expression n * 31 is equivalent to (n << 5) - n.

Uniliteral answered 19/5, 2009 at 18:10 Comment(0)
O
40

You can read Bloch's original reasoning under "Comments" in https://bugs.openjdk.org/browse/JDK-4045622. He investigated the performance of different hash functions in regards to the resulting "average chain size" in a hash table. P(31) was one of the common functions during that time which he found in K&R's book (but even Kernighan and Ritchie couldn't remember where it came from). In the end he basically had to choose one and so he took P(31) since it seemed to perform well enough. Even though P(33) was not really worse and multiplication by 33 is equally fast to calculate (just a shift by 5 and an addition), he opted for 31 since 33 is not a prime:

Of the remaining four, I'd probably select P(31), as it's the cheapest to calculate on a RISC machine (because 31 is the difference of two powers of two). P(33) is similarly cheap to calculate, but it's performance is marginally worse, and 33 is composite, which makes me a bit nervous.

So the reasoning was not as rational as many of the answers here seem to imply. But we're all good in coming up with rational reasons after gut decisions (and even Bloch might be prone to that).

Ohaus answered 10/2, 2016 at 0:46 Comment(0)
S
25

Actually, 37 would work pretty well! z := 37 * x can be computed as y := x + 8 * x; z := x + 4 * y. Both steps correspond to one LEA x86 instructions, so this is extremely fast.

In fact, multiplication with the even-larger prime 73 could be done at the same speed by setting y := x + 8 * x; z := x + 8 * y.

Using 73 or 37 (instead of 31) might be better, because it leads to denser code: The two LEA instructions only take 6 bytes vs. the 7 bytes for move+shift+subtract for the multiplication by 31. One possible caveat is that the 3-argument LEA instructions used here became slower on Intel's Sandy bridge architecture, with an increased latency of 3 cycles.

Moreover, 73 is Sheldon Cooper's favorite number.

Swine answered 27/7, 2011 at 19:37 Comment(4)
@Mainguy It's actually ALGOL syntax and is used fairly often in pseudo-code.Gripping
but in ARM assembly multiplication by 31 can be done in a single instructionAlexandria
@Mainguy In pseudo code what does := mean?Alexandria
In TPOP (1999) one can read about early Java (p.57): "... The problem was resolved by replacing the hash with one equivalent to the one we have shown (with a multiplier of 37) ..."Bowl
B
19

Neil Coffey explains why 31 is used under Ironing out the bias.

Basically using 31 gives you a more even set-bit probability distribution for the hash function.

Bingo answered 7/12, 2011 at 15:27 Comment(0)
B
18

From JDK-4045622, where Joshua Bloch describes the reasons why that particular (new) String.hashCode() implementation was chosen

The table below summarizes the performance of the various hash functions described above, for three data sets:

  1. All of the words and phrases with entries in Merriam-Webster's 2nd Int'l Unabridged Dictionary (311,141 strings, avg length 10 chars).

  2. All of the strings in /bin/, /usr/bin/, /usr/lib/, /usr/ucb/ and /usr/openwin/bin/* (66,304 strings, avg length 21 characters).

  3. A list of URLs gathered by a web-crawler that ran for several hours last night (28,372 strings, avg length 49 characters).

The performance metric shown in the table is the "average chain size" over all elements in the hash table (i.e., the expected value of the number of key compares to look up an element).

                          Webster's   Code Strings    URLs
                          ---------   ------------    ----
Current Java Fn.          1.2509      1.2738          13.2560
P(37)    [Java]           1.2508      1.2481          1.2454
P(65599) [Aho et al]      1.2490      1.2510          1.2450
P(31)    [K+R]            1.2500      1.2488          1.2425
P(33)    [Torek]          1.2500      1.2500          1.2453
Vo's Fn                   1.2487      1.2471          1.2462
WAIS Fn                   1.2497      1.2519          1.2452
Weinberger's Fn(MatPak)   6.5169      7.2142          30.6864
Weinberger's Fn(24)       1.3222      1.2791          1.9732
Weinberger's Fn(28)       1.2530      1.2506          1.2439

Looking at this table, it's clear that all of the functions except for the current Java function and the two broken versions of Weinberger's function offer excellent, nearly indistinguishable performance. I strongly conjecture that this performance is essentially the "theoretical ideal", which is what you'd get if you used a true random number generator in place of a hash function.

I'd rule out the WAIS function as its specification contains pages of random numbers, and its performance is no better than any of the far simpler functions. Any of the remaining six functions seem like excellent choices, but we have to pick one. I suppose I'd rule out Vo's variant and Weinberger's function because of their added complexity, albeit minor. Of the remaining four, I'd probably select P(31), as it's the cheapest to calculate on a RISC machine (because 31 is the difference of two powers of two). P(33) is similarly cheap to calculate, but it's performance is marginally worse, and 33 is composite, which makes me a bit nervous.

Josh

Brebner answered 12/6, 2017 at 21:17 Comment(0)
L
7

Java String hashCode() and 31

This is because 31 has a nice property – it's multiplication can be replaced by a bitwise shift which is faster than the standard multiplication:

31 * i == (i << 5) - i
Lillian answered 18/7, 2019 at 18:5 Comment(1)
Simply changing 5 to a 7 (h << 7) + h is multiplication too (129), and results in fewer collisions. I wonder why they didn't use that. I suppose it was purely a prior art situation: simply trust what you've seen to work in the past rather than investigate what may be better. The 31 multiplier was making the rounds back then on Usenet.Phage
R
5

Bloch doesn't quite go into this, but the rationale I've always heard/believed is that this is basic algebra. Hashes boil down to multiplication and modulus operations, which means that you never want to use numbers with common factors if you can help it. In other words, relatively prime numbers provide an even distribution of answers.

The numbers that make up using a hash are typically:

  • modulus of the data type you put it into (2^32 or 2^64)
  • modulus of the bucket count in your hashtable (varies. In java used to be prime, now 2^n)
  • multiply or shift by a magic number in your mixing function
  • The input value

You really only get to control a couple of these values, so a little extra care is due.

Reinforce answered 28/4, 2010 at 22:39 Comment(0)
S
5

In latest version of JDK, 31 is still used. https://docs.oracle.com/en/java/javase/12/docs/api/java.base/java/lang/String.html#hashCode()

The purpose of hash string is

  • unique (Let see operator ^ in hashcode calculation document, it help unique)
  • cheap cost for calculating

31 is max value can put in 8 bit (= 1 byte) register, is largest prime number can put in 1 byte register, is odd number.

Multiply 31 is <<5 then subtract itself, therefore need cheap resources.

Spanishamerican answered 18/2, 2019 at 2:3 Comment(0)
P
3

I'm not sure, but I would guess they tested some sample of prime numbers and found that 31 gave the best distribution over some sample of possible Strings.

Pfeffer answered 18/11, 2008 at 16:58 Comment(0)
C
1

A big expectation from hash functions is that their result's uniform randomness survives an operation such as hash(x) % N where N is an arbitrary number (and in many cases, a power of two), one reason being that such operations are used commonly in hash tables for determining slots. Using prime number multipliers when computing the hash decreases the probability that your multiplier and the N share divisors, which would make the result of the operation less uniformly random.

Others have pointed out the nice property that multiplication by 31 can be done by a multiplication and a subtraction. I just want to point out that there is a mathematical term for such primes: Mersenne Prime

All mersenne primes are one less than a power of two so we can write them as:

p = 2^n - 1

Multiplying x by p:

x * p = x * (2^n - 1) = x * 2^n - x = (x << n) - x

Shifts (SAL/SHL) and subtractions (SUB) are generally faster than multiplications (MUL) on many machines. See instruction tables from Agner Fog

That's why GCC seems to optimize multiplications by mersenne primes by replacing them with shifts and subs, see here.

However, in my opinion, such a small prime is a bad choice for a hash function. With a relatively good hash function, you would expect to have randomness at the higher bits of the hash. However, with the Java hash function, there is almost no randomness at the higher bits with shorter strings (and still highly questionable randomness at the lower bits). This makes it more difficult to build efficient hash tables. See this nice trick you couldn't do with the Java hash function.

Some answers mention that they believe it is good that 31 fits into a byte. This is actually useless since:

(1) We execute shifts instead of multiplications, so the size of the multiplier does not matter.

(2) As far as I know, there is no specific x86 instruction to multiply an 8 byte value with a 1 byte value so you would have needed to convert "31" to a 8 byte value anyway even if you were multiplying. See here, you multiply entire 64bit registers.

(And 127 is actually the largest mersenne prime that could fit in a byte.)

Does a smaller value increase randomness in the middle-lower bits? Maybe, but it also seems to greatly increase the possible collisions :).

One could list many different issues but they generally boil down to two core principles not being fulfilled well: Confusion and Diffusion

But is it fast? Probably, since it doesn't do much. However, if performance is really the focus here, one character per loop is quite inefficient. Why not do 4 characters at a time (8 bytes) per loop iteration for longer strings, like this? Well, that would be difficult to do with the current definition of hash where you need to multiply every character individually (please tell me if there is a bit hack to solve this :D).

Colston answered 23/6, 2020 at 23:54 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.