Why @AutoValue annotation uses the specific integer 1000003 for calculating hash code?
Asked Answered
B

1

2

Java hash code generation code often uses prime numbers in its calculations. There are good reasons for this, as explained in Why use a prime number in hashCode? and elsewhere.

For example, AutoValue will generate the following hash code for a given value class:

@Override
public int hashCode() {
  int h = 1;
  h *= 1000003;
  h ^= this.firstName.hashCode();
  h *= 1000003;
  h ^= this.lastName.hashCode();
  h *= 1000003;
  h ^= this.age;
  return h;
}

What is the reason behind AutoValue using the specific integer 1000003 instead of some other prime number? If I use IntelliJ to create an overridden hashCode method, it uses integer 31. Is there some logical and mathematical reasoning behind using the integer 1000003 to calculate hash codes, rather than some other prime number? A Google search did not give me any answer to this.

Its curious to know what the authors were thinking.

Bimonthly answered 27/6, 2018 at 23:14 Comment(11)
Im pretty sure 10^6+3 is a prime, which makes it less likely for a hash collision to happen than a composite number.Idocrase
A prime number is needed to avoid hash collisions as @AJC mentions. My guess is someone might have decided to "use the 1st prime number over a million" for possibly arbitrary reasons. It would be interesting to know if there is a deeper reason.Hasin
yeah, that seems like a googly thing to do..Bimonthly
Note that Java's Object.hash, Arrays.hashCode and List.hashCode all use the same implementation with 31 as their prime number. If that's what Java is using, I suspect there's no benefit to 1000003 instead of 31.Credential
@M.Justin: "why was this chosen" questions are notoriously hard to do on SO, because they tend to not have "correct" answers unless the designers themselves show up. My suggestion is to contact the authors of the code directly and ask if anyone can remember the specific reasons for choosing that number.Sustainer
@JoachimSauer Fair enough. One thing that's not always apparent as a Stack Overflow curator is whether a question should be reopened because its close reason is invalid, or whether it should remain closed because although its close reason is invalid, it is still not a good fit for the site for other reasons. I don't think Stack Overflow currently has a good solution to this problem, since people don't want to waste their time reopening and reclosing a question closed for the wrong reason. Though maybe that's actually implemented now: meta.stackexchange.com/a/382937/308327.Credential
True, I've run into this issue myself a few times and usually simply stop interacting with the question ;-) Unrelated, I was curious if the history of autovalue gives a hint as to the answer of this question, but it seems that all public history of this project already contained that magic constant: github.com/google/auto/commit/…Sustainer
@JoachimSauer Yep, that's where my personal investigations stalled out as well. "An initial import of @autovalue"Credential
@JoachimSauer Your advice has been followed, authors have been asked, and an explanation has been given: github.com/google/auto/discussions/1516. Kevin Bourrillion: "use a hash calculation that was found by [another person] to perform far better than *31+"; "[…] I don't remember the details […] shift things over a golden-ratio fraction of the way. […] I guess I also thought it should be a nice simple number to the human eye. […] the idea of making it bigger was just to eat up all those initial zeros more quickly and get the bits to come back around and interfere with each other."Credential
@JoachimSauer Which means that in addition to not being a duplicate, this question now has a concrete answer that can be given. Which I'd say argues much more strongly for reopening the question.Credential
Just a guess: when using 31 as multiplier, the number pairs (0, 31) and (1, 0) produce the same hash code. By using a prime number that is larger than the typical numbers that are hashed, there is less chance of such a conflict.Haynes
C
1

According to an internal Google commit, 1000003 was chosen because a former Google employee found it to perform better than 31:

use a hash calculation that was found by [another person] to perform far better than *31+

When asked about this, AutoValue developer Kevin Bourrillion explained reasons the number might have been chosen:

Although I don't remember the details.... it is kind of exactly like me to shift things over a golden-ratio fraction of the way. Though that might argue for a multiplier of 898,459, so I guess I also thought it should be a nice simple number to the human eye.

But yeah, the idea of making it bigger was just to eat up all those initial zeros more quickly and get the bits to come back around and interfere with each other.

He also pointed out that a larger number reduces the chance of a hash collision in real-world scenarios:

Also: (using Integers for simplicity) with List.of(a, b) and List.of(c, d) you get a collision if b - d happens to be 31 times c - a. And it's just a little easier to imagine real-world circumstances where that might happen to be true than it is with a much larger multiplier. No exact science here, though. Object.hashCode is never exactly going to be a high-quality hash function in any case.

Credential answered 16/5, 2023 at 15:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.