Default implementation for Object.GetHashCode()
Asked Answered
S

7

198

How does the default implementation for GetHashCode() work? And does it handle structures, classes, arrays, etc. efficiently and well enough?

I am trying to decide in what cases I should pack my own and in what cases I can safely rely on the default implementation to do well. I don't want to reinvent the wheel, if at all possible.

Subjectivism answered 6/4, 2009 at 3:25 Comment(5)
Have a look at the comment I left on the article: https://mcmap.net/q/49271/-gethashcode-extension-methodShaum
See also https://mcmap.net/q/49272/-object-gethashcodeColier
Aside: you can obtain the default hashcode (even when GetHashCode() has been overridden) by using System.Runtime.CompilerServices.RuntimeHelpers.GetHashCode(obj)Burch
@MarcGravell thank you for contributing this, I was searching for exactly this answer.Tranche
@MarcGravell But how would I do this with other method?Conation
G
94
namespace System {
    public class Object {
        [MethodImpl(MethodImplOptions.InternalCall)]
        internal static extern int InternalGetHashCode(object obj);

        public virtual int GetHashCode() {
            return InternalGetHashCode(this);
        }
    }
}

InternalGetHashCode is mapped to an ObjectNative::GetHashCode function in the CLR, which looks like this:

FCIMPL1(INT32, ObjectNative::GetHashCode, Object* obj) {  
    CONTRACTL  
    {  
        THROWS;  
        DISABLED(GC_NOTRIGGER);  
        INJECT_FAULT(FCThrow(kOutOfMemoryException););  
        MODE_COOPERATIVE;  
        SO_TOLERANT;  
    }  
    CONTRACTL_END;  

    VALIDATEOBJECTREF(obj);  

    DWORD idx = 0;  

    if (obj == 0)  
        return 0;  

    OBJECTREF objRef(obj);  

    HELPER_METHOD_FRAME_BEGIN_RET_1(objRef);        // Set up a frame  

    idx = GetHashCodeEx(OBJECTREFToObject(objRef));  

    HELPER_METHOD_FRAME_END();  

    return idx;  
}  
FCIMPLEND

The full implementation of GetHashCodeEx is fairly large, so it's easier to just link to the C++ source code.

Greenfield answered 6/4, 2009 at 3:43 Comment(9)
That documentation quote must have come from a very early version. It is no longer written like this in current MSDN articles, probably because it is quite wrong.Brumbaugh
They changed the wording, yes, but it still says basically the same thing: "Consequently, the default implementation of this method must not be used as a unique object identifier for hashing purposes."Greenfield
Why does the documentation claim that implementation is not particularly useful for hashing? If an object is equal to itself and nothing else, any hash code method which will always return the same value for a given object instance, and will generally return different values for different instances, what's the problem?Elviselvish
@Elviselvish Also, two objects that represent the same value have the same hash code only if they are the exact same object. Consider if you used this hash for strings (disregard string interning for the moment): new string('x', 5).GetHashCode() != new string('x', 5).GetHashCode() because these two strings are not the exact same object. Same value, different object. If you put one of them into a hash set (e.g. as a Dictionary key) you'd never be able to look them up again unless by coincidence: using the hashcode, myDictionary["xxxxx"] would likely look in the wrong hash bucket.Emmanuelemmeline
@ta.speot.is: If what you want is to determine whether a particular instance has already been added into a dictionary, reference equality is perfect. With strings, as you note, one is usually more interested in whether a string containing the same sequence of characters has already been added. That's why string overrides GetHashCode. On the other hand, suppose you want to keep a count of how many times various controls process Paint events. You could use a Dictionary<Object, int[]> (every int[] stored would hold exactly one item).Elviselvish
@ta.speot.is: When you get a Paint event from one of the controls you're watching, you could use MyControlCounts[Sender][0]++; (or some variation with TryGetValue). Even if the controls happened to define some form of value equality, that wouldn't be what you were interested in. You'd want to use reference equality along with the default (reference-based) hash code.Elviselvish
@Elviselvish If the reference is the value you want to track, I imagine the default implementation of GetHashCode is fine.Emmanuelemmeline
@It'sNotALie. Then thank Archive.org for having a copy ;-)Cabrales
matrix transform vectorZela
B
108

For a class, the defaults are essentially reference equality, and that is usually fine. If writing a struct, it is more common to override equality (not least to avoid boxing), but it is very rare you write a struct anyway!

When overriding equality, you should always have a matching Equals() and GetHashCode() (i.e. for two values, if Equals() returns true they must return the same hash-code, but the converse is not required) - and it is common to also provide ==/!= operators, and often to implement IEquatable<T> too.


These days, when generating a hash, the HashCode utility type is very useful; for example:

return HashCode.Combine(field1, field2); // multiple overloads available here

When that isn't available:

For generating the hash code, it is common to use a factored sum, as this avoids collisions on paired values - for example, for a basic 2 field hash:

unchecked // disable overflow, for the unlikely possibility that you
{         // are compiling with overflow-checking enabled
    int hash = 27;
    hash = (13 * hash) + field1.GetHashCode();
    hash = (13 * hash) + field2.GetHashCode();
    return hash;
}

This has the advantage that:

  • the hash of {1,2} is not the same as the hash of {2,1}
  • the hash of {1,1} is not the same as the hash of {2,2}

etc - which can be common if just using an unweighted sum, or xor (^), etc.

Burch answered 6/4, 2009 at 4:29 Comment(5)
Excellent point about the benefit of a factored-sum algorithm; something I had not realised before!Programmer
Won't the factored sum (as written above) cause overflow exceptions occasionally?Suzy
@Suzy yes, it should be performed unchecked. Fortunately, unchecked is the default in C#, but it would be better to make it explicit; editedBurch
Could someone elaborate on the choice of 27 and 13?Uttica
@Uttica pretty arbitrary, to be honest; these days, I would say "use HashCode.Combine instead"; in fact, I'll edit the answerBurch
G
94
namespace System {
    public class Object {
        [MethodImpl(MethodImplOptions.InternalCall)]
        internal static extern int InternalGetHashCode(object obj);

        public virtual int GetHashCode() {
            return InternalGetHashCode(this);
        }
    }
}

InternalGetHashCode is mapped to an ObjectNative::GetHashCode function in the CLR, which looks like this:

FCIMPL1(INT32, ObjectNative::GetHashCode, Object* obj) {  
    CONTRACTL  
    {  
        THROWS;  
        DISABLED(GC_NOTRIGGER);  
        INJECT_FAULT(FCThrow(kOutOfMemoryException););  
        MODE_COOPERATIVE;  
        SO_TOLERANT;  
    }  
    CONTRACTL_END;  

    VALIDATEOBJECTREF(obj);  

    DWORD idx = 0;  

    if (obj == 0)  
        return 0;  

    OBJECTREF objRef(obj);  

    HELPER_METHOD_FRAME_BEGIN_RET_1(objRef);        // Set up a frame  

    idx = GetHashCodeEx(OBJECTREFToObject(objRef));  

    HELPER_METHOD_FRAME_END();  

    return idx;  
}  
FCIMPLEND

The full implementation of GetHashCodeEx is fairly large, so it's easier to just link to the C++ source code.

Greenfield answered 6/4, 2009 at 3:43 Comment(9)
That documentation quote must have come from a very early version. It is no longer written like this in current MSDN articles, probably because it is quite wrong.Brumbaugh
They changed the wording, yes, but it still says basically the same thing: "Consequently, the default implementation of this method must not be used as a unique object identifier for hashing purposes."Greenfield
Why does the documentation claim that implementation is not particularly useful for hashing? If an object is equal to itself and nothing else, any hash code method which will always return the same value for a given object instance, and will generally return different values for different instances, what's the problem?Elviselvish
@Elviselvish Also, two objects that represent the same value have the same hash code only if they are the exact same object. Consider if you used this hash for strings (disregard string interning for the moment): new string('x', 5).GetHashCode() != new string('x', 5).GetHashCode() because these two strings are not the exact same object. Same value, different object. If you put one of them into a hash set (e.g. as a Dictionary key) you'd never be able to look them up again unless by coincidence: using the hashcode, myDictionary["xxxxx"] would likely look in the wrong hash bucket.Emmanuelemmeline
@ta.speot.is: If what you want is to determine whether a particular instance has already been added into a dictionary, reference equality is perfect. With strings, as you note, one is usually more interested in whether a string containing the same sequence of characters has already been added. That's why string overrides GetHashCode. On the other hand, suppose you want to keep a count of how many times various controls process Paint events. You could use a Dictionary<Object, int[]> (every int[] stored would hold exactly one item).Elviselvish
@ta.speot.is: When you get a Paint event from one of the controls you're watching, you could use MyControlCounts[Sender][0]++; (or some variation with TryGetValue). Even if the controls happened to define some form of value equality, that wouldn't be what you were interested in. You'd want to use reference equality along with the default (reference-based) hash code.Elviselvish
@Elviselvish If the reference is the value you want to track, I imagine the default implementation of GetHashCode is fine.Emmanuelemmeline
@It'sNotALie. Then thank Archive.org for having a copy ;-)Cabrales
matrix transform vectorZela
J
12

Since I couldn't find an answer that explains why we should override GetHashCode and Equals for custom structs and why the default implementation "is not likely to be suitable for use as a key in a hash table", I'll leave a link to this blog post, which explains why with a real-case example of a problem that happened.

I recommend reading the whole post, but here is a summary (emphasis and clarifications added).

Reason the default hash for structs is slow and not very good:

The way the CLR is designed, every call to a member defined in System.ValueType or System.Enum types [may] cause a boxing allocation [...]

An implementer of a hash function faces a dilemma: make a good distribution of the hash function or to make it fast. In some cases, it's possible to achieve them both, but it is hard to do this generically in ValueType.GetHashCode.

The canonical hash function of a struct "combines" hash codes of all the fields. But the only way to get a hash code of a field in a ValueType method is to use reflection. So, the CLR authors decided to trade speed over the distribution and the default GetHashCode version just returns a hash code of a first non-null field and "munges" it with a type id [...] This is a reasonable behavior unless it's not. For instance, if you're unlucky enough and the first field of your struct has the same value for most instances, then a hash function will provide the same result all the time. And, as you may imagine, this will cause a drastic performance impact if these instances are stored in a hash set or a hash table.

[...] Reflection-based implementation is slow. Very slow.

[...] Both ValueType.Equals and ValueType.GetHashCode have a special optimization. If a type does not have "pointers" and is properly packed [...] then more optimal versions are used: GetHashCode iterates over an instance and XORs blocks of 4 bytes and Equals method compares two instances using memcmp. [...] But the optimization is very tricky. First, it is hard to know when the optimization is enabled [...] Second, a memory comparison will not necessarily give you the right results. Here is a simple example: [...] -0.0 and +0.0 are equal but have different binary representations.

Real-world issue described in the post:

private readonly HashSet<(ErrorLocation, int)> _locationsWithHitCount;
readonly struct ErrorLocation
{
    // Empty almost all the time
    public string OptionalDescription { get; }
    public string Path { get; }
    public int Position { get; }
}

We used a tuple that contained a custom struct with default equality implementation. And unfortunately, the struct had an optional first field that was almost always equals to [empty string]. The performance was OK until the number of elements in the set increased significantly causing a real performance issue, taking minutes to initialize a collection with tens of thousands of items.

So, to answer the question "in what cases I should pack my own and in what cases I can safely rely on the default implementation", at least in the case of structs, you should override Equals and GetHashCode whenever your custom struct might be used as a key in a hash table or Dictionary.
I would also recommend implementing IEquatable<T> in this case, to avoid boxing.

As the other answers said, if you're writing a class, the default hash using reference equality is usually fine, so I wouldn't bother in this case, unless you need to override Equals (then you would have to override GetHashCode accordingly).

Jemimah answered 20/7, 2018 at 4:14 Comment(0)
P
7

The documentation for the GetHashCode method for Object says "the default implementation of this method must not be used as a unique object identifier for hashing purposes." and the one for ValueType says "If you call the derived type's GetHashCode method, the return value is not likely to be suitable for use as a key in a hash table.".

The basic data types like byte, short, int, long, char and string implement a good GetHashCode method. Some other classes and structures, like Point for example, implement a GetHashCode method that may or may not be suitable for your specific needs. You just have to try it out to see if it's good enough.

The documentation for each class or structure can tell you if it overrides the default implementation or not. If it doesn't override it you should use your own implementation. For any classes or structs that you create yourself where you need to use the GetHashCode method, you should make your own implementation that uses the appropriate members to calculate the hash code.

Phosphide answered 6/4, 2009 at 4:2 Comment(7)
I'd disagree that you should routinely add your own implementation. Simply, the vast majority of classes (in particular) will never be tested for equality - or where they are, the inbuilt reference equality is fine. In the (already rare) occasion of writing a struct, it would be more common, true.Burch
@Marc Gravel: That's of course not what I meant it to say. I will adjust the last paragraph. :)Phosphide
Basic data types do not implement a good GetHashCode method, at least in my case. For example, GetHashCode for int returns the number itself: (123).GetHashCode() returns 123.Officeholder
@user502144 And what's wrong with that? It's a perfect unique identifier that's easy to calculate, with no false positives on equality ...Yvonneyvonner
@Richard Rast: It's OK except keys can be badly distributed when used in a Hashtable. Take a look at this answer: https://mcmap.net/q/49273/-hashtables-dictionary-etc-with-integer-keysOfficeholder
@user502144 Unless your hash table users an array that's over 2 trillion parts long you're gonna have to process the hash anyway... probably take the modulo of some prime number. Anyway, in the real world you're unlikely to need a hash table that's different than Dictionary<TKey, TValue which already handles this.Winding
the default implementation of this method must not be used as a unique object identifier for hashing purposes and yet, WPF uses it #42579206 fairlySeraphim
D
4

Until now, the default GetHashCode implementation for object is unrelated to the object itself and should be unique for each object. And here's the code:

    inline DWORD GetNewHashCode()
    {
        LIMITED_METHOD_CONTRACT;
        // Every thread has its own generator for hash codes so that we won't get into a situation
        // where two threads consistently give out the same hash codes.
        // Choice of multiplier guarantees period of 2**32 - see Knuth Vol 2 p16 (3.2.1.2 Theorem A).
        DWORD multiplier = GetThreadId()*4 + 5;
        m_dwHashCodeSeed = m_dwHashCodeSeed*multiplier + 1;
        return m_dwHashCodeSeed;
    }

Here's the call stack:

Thread::GetNewHashCode

Object::ComputeHashCode

Object::GetHashCodeEx

Drainage answered 9/10, 2022 at 9:32 Comment(0)
Z
2

Generally speaking, if you're overriding Equals, you want to override GetHashCode. The reason for this is because both are used to compare equality of your class/struct.

Equals is used when checking Foo A, B;

if (A == B)

Since we know the pointer isn't likely to match, we can compare the internal members.

Equals(obj o)
{
    if (o == null) return false;
    MyType Foo = o as MyType;
    if (Foo == null) return false;
    if (Foo.Prop1 != this.Prop1) return false;

    return Foo.Prop2 == this.Prop2;
}

GetHashCode is generally used by hash tables. The hashcode generated by your class should always be the same for a classes give state.

I typically do,

GetHashCode()
{
    int HashCode = this.GetType().ToString().GetHashCode();
    HashCode ^= this.Prop1.GetHashCode();
    etc.

    return HashCode;
}

Some will say that the hashcode should only be calculated once per object lifetime, but I don't agree with that (and I'm probably wrong).

Using the default implementation provided by object, unless you have the same reference to one of your classes, they will not be equal to each other. By overriding Equals and GetHashCode, you can report equality based on internal values rather than the objects reference.

Zedoary answered 6/4, 2009 at 3:42 Comment(3)
The ^= approach is not a particularly good approach for generating a hash - it tends to lead to a lot of common/predictable collisions - for example if Prop1 = Prop2 = 3.Burch
If the values are the same, I don't see a problem with the collision as the objects are equal. The 13 * Hash + NewHash seems interesting though.Zedoary
Ben: try it for Obj1 { Prop1 = 12, Prop2 = 12 } and Obj2 { Prop1 = 13, Prop2 = 13 }Commonage
S
1

If you're just dealing with POCOs you can use this utility to simplify your life somewhat:

var hash = HashCodeUtil.GetHashCode(
           poco.Field1,
           poco.Field2,
           ...,
           poco.FieldN);

...

public static class HashCodeUtil
{
    public static int GetHashCode(params object[] objects)
    {
        int hash = 13;

        foreach (var obj in objects)
        {
            hash = (hash * 7) + (!ReferenceEquals(null, obj) ? obj.GetHashCode() : 0);
        }

        return hash;
    }
}
Strobile answered 20/3, 2019 at 5:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.