Hash code non-zero initial value - note: I am not asking about primes
Asked Answered
M

4

6

This is kind of an academic point, but I feel I don't fully understand hash codes if I don't understand why this is recommended by books such as Effective Java and many SO questions.

Suppose:

public sealed class Point
{
    private readonly int x;
    private readonly int y;

    //constructor ommited

    //equals ommited
    
    public override int GetHashcode()
    {
       int hash = 17; //why should the initial value be non-zero?
       unchecked
       {
         hash = hash * 31 + x; //do not tell me why I should use primes - that is not the question
         hash = hash * 31 + y;
         return hash;
       }
    }
}

Now, supposedly, the reason for the initial value is that it reduces collisions where one of the components is zero.

I'm struggling to find any example where this helps.

Here is one example of a collision, but having an initial value makes no odds.

x   y   Hash Without initial value     Hash With initial value  
0   31  31                             16368                
1   0   31                             16368                

Ideally, I'm looking for a concrete example where the initial value prevents a collision.

My theory on why an initial value can never make a difference

//Given a prime p, initial value i, fields a,b,c, calculate hash h
h = i;
h = h*p + a;
h = h*p + b;
h = h*p + c;

Therefore:

h = ((i*p + a)*p + b)*p + c
  = (ipp + ap + b   )*p + c
  = ippp + app + bp + c

Therefore the inital value i will effect all hash codes in the same way by producing a constant value, in this case i*p3.

Manno answered 19/11, 2012 at 15:20 Comment(13)
#3613602 and in particular https://mcmap.net/q/30382/-why-use-a-prime-number-in-hashcodeGlenn
@Glenn thanks, but I'm not querying the wisdom of the prime, I'm querying the wisdom of having an arbitary initial value, the 17 here. And I get the word arbitary from Effective Java.Manno
I'm not sure I understand your implementation. I'd multiply x and y by two different prime numbers (that would indeed avoid trivial collisions since prime_1*y is never equal to prime_2*x). I suspect having 17 is a way to avoid that some particular collisions happen more often than other (i.e. it affects the distribution of the collisions). So indeed your two cases that generate collisions still do that, but adding 17 (or some other prime number) may make them less likelyAustreng
This seems to be a pretty thorough explanation: computinglife.wordpress.com/2008/11/20/…Matrices
@user1638891 It's pretty standard aproach from Effective Java see here (uses 23 where I use 31) #263900Manno
Why is this tagged as Java and C#? Where exactly does this Point class come from, considering there is a buil-in Point class in .NET this question makes zero sense because of the C# tag.Moorish
@JesseC.Slicer unless I missed it, that doesn't seem to mention the initial value.Manno
@Ramhound: point is purely used for illustrative purposes, i think we can all agree the main point of the question is what is the benefit of a prime seed value in a custom hash code, which is definitely applicable for both .NET and Java...Conchita
@Ramhound It's just an example, and getHashcode/GetHashcode is common to both langauges. Also, there is not a built-in Point class. There is a built-in Point struct. What's more, my Point class is immutable, so it's a different beast.Manno
@Manno - Point...Class you knew exacly what I meant. The 17 is arbitary which means there is no reason. Its only attempt to make more unique values possible.Moorish
It is not a duplicate! I am asking about initial value, not questioning use of prime!Manno
@weston, did you ever find an explanation for this initial value? Just noticed that my IDE uses the same (Java equivalent) code - just with 3 as the initial value—been trying to understand in which circumstances it helps...Orrery
No I did not. And I do not use an initial value now.Manno
M
0

The choice of initial value can never make a difference to the hash.

Example:

//Given a prime p, initial value i, fields a,b,c, calculate hash h
h = i;
h = h*p + a;
h = h*p + b;
h = h*p + c;
h = h % 2^32;

Therefore:

h = (((ip  + a) * p + b) * p + c) % 2^32
  = (( ip² + ap     + b) * p + c) % 2^32
  = (  ip³ + ap²    + bp     + c) % 2^32
  = ip³ % 2^32 + (ap² + bp + c) % 2^32

Therefore the initial value i will effect all hash codes in the same way by adding a constant value to the hash, in this case i*p³ % 2^32.

Manno answered 18/12, 2020 at 15:26 Comment(0)
H
2

The initial value must be a prime number. Why? Because say you are hashing to get an index for an array of length = 20: [object.getHash()%20] is the index of the array where you will want to store your object. If you had used an even number: half of the addresses of your data structure would never be used...this is why you need to use an initial value: to minimize collisions...and maximize data structure usage

Hunkers answered 19/11, 2012 at 15:31 Comment(11)
OK, give me an example where the initial value saves a collision then.Manno
in your algorithm with initial hash = 0, (0,1) and (0,21) would both hash to index 1 in an array of 20 having an initial value renders very hard to find a collisionHunkers
@weston: Not sure about the prime in particular for the seed value, but it's definitely better for it to be non-zero so you don't reduce the first hash component to just x.Conchita
@doctorkiller The initial value doesn't help in that example either.Manno
@JamesMichaelHare This is all I hear, it is better, but can anyone give me a concrete example of a case where it helps?Manno
@weston: you are right...the problem is that your hash computation should not end with "+y" ... it's the flaw of your algorythmHunkers
@Manno all I can think of is that if we have a constant there, the resulting numbers will be larger, and will overflow more easily, resulting in more "wildly" varying hash codes... But that is just a real wild guess, and really doesn't answer why it is a prime...Kingfisher
@weston: interesting, was reading where Jon Skeet suggested the same Effective Java-esque algorithm and in the comments he even states --- "I don't know the exact details of all the reasons behind it, but starting with 0 would mean that the hash would stay as zero if the individual field hashes were zero... and that's probably not uncommon (e.g. an integer of value zero will probably hash to zero). Just a guess, but I suspect that having a constant which gets propagated with each field is useful. This is really just copied from Effective Java though :)Conchita
That's from this post, BTW: #263900Conchita
@JamesMichaelHare I thought you understood my question, why did you vote to close it as duplicate?Manno
I had voted to close before a lot of the discussion got into the particular details in question. I have since voted to re-open.Conchita
E
1

Using prime numbers have shown via experiment and testing that have good properties for hash functions.
Also hard-coded numbers you see in existing libraries e.g. 31 in Java have been found during testing that they are good options. As far as I know there is no some proof behind the choices of these "magic" numbers.They were selected only after field testing

Update:
If you use zero as initial value then your hash will be affected by member variables also zero.
E.g. hash = hash * 31 + x; will be 0 if x is 0 and your initial value is also 0.
Then you end up with y which could be 0 as well or a number that could happen to be very common in your application domain and end up with collisions

Eyepiece answered 19/11, 2012 at 15:45 Comment(8)
Thanks, but the question is about the need for a non-zero initial value, not the use of primes.Manno
@weston:1)The initial value you use is prime 2)If the initial value was zero here:hash = hash * 31 + x; you would get hash=xEyepiece
Yes, so? x is a valid hash. My query is, we're told this reduces collisions. Under what circumstances? I believe I have shown this does not. See my theorum.Manno
thanks, but who cares if the hash can clash with objects of different type. When would that be a problem? Do you store multiple types in one hashmap/dictionary? Besides, people who implement this will just use the same initial value over and over, beause they don't appreciate what it does and don't change it each object type, so negating that argument somewhat.Manno
1)I don't see that this is an issue only for raw containers 2)Using an initial value (the same as you point out) would not negate the point, it would have collisions but a lot lessEyepiece
raw containers being hashmap/dictionary etc? If so, what else is getHashcode used for?Manno
I don't know C# but in Java raw container in this case hashmap/dictionary/hashtable are collections that are not parameterized and could contain different type of objects.This was the nor in prior Java1.5Eyepiece
Ok, well smelly as that is, using initial values to help you store multiple types in one collection. That's one possibility. Have a hard earned +1!Manno
S
0

Initial value can be useful to distinguish between objects of different classes.

The hash function that you show above is just not very good, resulting very easily in collisions for objects with different property values. The idea of a hash function is that it results in a unique, or almost unique, value depending on the public properties.

So to get values that are as unique as possible:

  • use a good hash function that results in a nice distribution
  • use a proper initial value that distinguishes even more so that the chance of a Point and a Line returning the same hash becomes smaller.
Slur answered 19/11, 2012 at 15:31 Comment(2)
Just so you can store lines and points in the same hashset/dictionary? Or am I missing another reason?Manno
So that you can distinguish them for whatever reason -- but keeping in mind that collisions are not always avoidable. If you don't want collisions you must use GUID identifiers.Slur
M
0

The choice of initial value can never make a difference to the hash.

Example:

//Given a prime p, initial value i, fields a,b,c, calculate hash h
h = i;
h = h*p + a;
h = h*p + b;
h = h*p + c;
h = h % 2^32;

Therefore:

h = (((ip  + a) * p + b) * p + c) % 2^32
  = (( ip² + ap     + b) * p + c) % 2^32
  = (  ip³ + ap²    + bp     + c) % 2^32
  = ip³ % 2^32 + (ap² + bp + c) % 2^32

Therefore the initial value i will effect all hash codes in the same way by adding a constant value to the hash, in this case i*p³ % 2^32.

Manno answered 18/12, 2020 at 15:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.