Object.GetHashCode

Asked 16/7, 2009 at 19:28 Answered 16/7, 2009 at 20:38

My question may duplicate Default implementation for Object.GetHashCode() but I'm asking again because I didn't understand the accepted answer to that one.

To begin with I have three questions about the accepted answer to the previous question, which quotes some documentation as follows:

"However, because this index can be reused after the object is reclaimed during garbage collection, it is possible to obtain the same hash code for two different objects."

Is this true? It seems to me that two objects won't have the same hash code, because an object's code isn't reused until the object is garbage collected (i.e. no longer exists).

"Also, two objects that represent the same value have the same hash code only if they are the exact same object."

Is this a problem? For example, I want to associate some data with each of the node instances in a DOM tree. To do this, the 'nodes' must have an identity or hash code, so that I can use them as keys in a dictionary of the data. Isn't a hash code which identities whether it's "the exact same object", i.e. "reference equality rather than "value equality", what I want?

"This implementation is not particularly useful for hashing; therefore, derived classes should override GetHashCode"

Is this true? If it's not good for hashing, then what if anything is it good for, and why is it even defined as a method of Object?

My final (and perhaps most important to me) question is, if I must invent/override a GetHashCode() implementation for an arbitrary type which has "reference equality" semantics, is the following a reasonable and good implementation:

class SomeType
{
  //create a new value for each instance
  static int s_allocated = 0;
  //value associated with this instance
  int m_allocated;
  //more instance data
  ... plus other data members ...
  //constructor
  SomeType()
  {
    allocated = ++s_allocated;
  }
  //override GetHashCode
  public override int GetHashCode()
  {
    return m_allocated;
  }
}

Edit

FYI I tested it, using the following code:

    class TestGetHash
    {
        //default implementation
        class First
        {
            int m_x;
        }
        //my implementation
        class Second
        {
            static int s_allocated = 0;
            int m_allocated;
            int m_x;
            public Second()
            {
                m_allocated = ++s_allocated;
            }
            public override int GetHashCode()
            {
                return m_allocated;
            }
        }
        //stupid worst-case implementation
        class Third
        {
            int m_x;
            public override int GetHashCode()
            {
                return 0;
            }
        }

        internal static void test()
        {
            testT<First>(100, 1000);
            testT<First>(1000, 100);
            testT<Second>(100, 1000);
            testT<Second>(1000, 100);
            testT<Third>(100, 100);
            testT<Third>(1000, 10);
        }

        static void testT<T>(int objects, int iterations)
            where T : new()
        {
            System.Diagnostics.Stopwatch stopWatch =
                System.Diagnostics.Stopwatch.StartNew();
            for (int i = 0; i < iterations; ++i)
            {
                Dictionary<T, object> dictionary = new Dictionary<T, object>();
                for (int j = 0; j < objects; ++j)
                {
                    T t = new T();
                    dictionary.Add(t, null);
                }
                for (int k = 0; k < 100; ++k)
                {
                    foreach (T t in dictionary.Keys)
                    {
                        object o = dictionary[t];
                    }
                }
            }
            stopWatch.Stop();
            string stopwatchMessage = string.Format(
                "Stopwatch: {0} type, {1} objects, {2} iterations, {3} msec",
                typeof(T).Name, objects, iterations,
                stopWatch.ElapsedMilliseconds);
            System.Console.WriteLine(stopwatchMessage);
        }
    }

On my machine the results/output are as follows:

First type, 100 objects, 1000 iterations, 2072 msec
First type, 1000 objects, 100 iterations, 2098 msec
Second type, 100 objects, 1000 iterations, 1300 msec
Second type, 1000 objects, 100 iterations, 1319 msec
Third type, 100 objects, 100 iterations, 1487 msec
Third type, 1000 objects, 10 iterations, 13754 msec

My implementation takes half the time of the default implementation (but my type is bigger by the size of my m_allocated data member).

My implementation and the default implementation both scale linearly.

In comparison and as a sanity check, the stupid implementation starts bad and scales worse.

Incarnate answered 16/7, 2009 at 19:28 Comment(3)

Perhaps using volatile declaration and interlocked incrementors, for thread-safety. – Dixon 16/7, 2009 at 19:43

Would it even matter at all if there were an occasional accidental collision due to its not being thread-safe? I thought that the implementation of GetHashCode does not need to guarantee unique return values for different objects, and that instead it's sufficient if they're just mostly unique. – Incarnate 16/7, 2009 at 19:56

Very interesting to see that your simple implementation significantly increases lookup performance, at the cost of 4B / instance. An interesting trade-off. – Schram 9/2, 2018 at 22:52

The most important property a hash code implementation must have is this:

If two objects compare as equal then they must have identical hash codes.

If you have a class where instances of the class compare by reference equality, then you do not need to override GetHashCode; the default implementation guarantees that two objects that are the same reference have the same hash code. (You're calling the same method twice on the same object, so of course the result is the same.)

If you have written a class which implements its own equality that is different from reference equality then you are REQUIRED to override GetHashCode such that two objects that compare as equal have equal hash codes.

Now, you could do so by simply returning zero every time. That would be a lousy hash function, but it would be legal.

Other properties of good hash functions are:

GetHashCode should never throw an exception
Mutable objects which compare for equality on their mutable state, and therefore hash on their mutable state, are dangerously bug-prone. You can put an object into a hash table, mutate it, and be unable to get it out again. Try to never hash or compare for equality on mutable state.
GetHashCode should be extremely fast -- remember, the purpose of a good hash algorithm is to improve the performance of lookups. If the hash is slow then the lookups can't be made fast.
Objects which do not compare as equal should have dissimilar hash codes, well distributed over the whole range of a 32 bit integer

Pectize answered 16/7, 2009 at 20:21 Comment(3)

Would it be true, or too optimistic, to read that as, "the default object.GetHashCode is a good implemention of GetHashCode for types which use reference equality, i.e. for classes and/or interfaces (but not structs) for which you have not overriden object.Equals; where 'good implementation' means good/speedy performance when using these types as the key of a Dictionary." – Incarnate 16/7, 2009 at 20:43

Yes, the default implementation of GetHashCode that we provide for you is a pretty good one. – Pectize 16/7, 2009 at 21:16

"the purpose of a good hash algorithm is to improve the performance of lookups". This alone could be the best answer the OP can get. – Pursuant 3/8, 2011 at 15:3

Question:

Is this true? It seems to me that two objects won't have the same hash code, because an object's code isn't reused until the object is garbage collected (i.e. no longer exists).

Two objects may share the same hash code, if it is generated by default GetHashCode implementation, because:

Default GetHashCode result shouldn't be changed during object's lifetime, and default implementation ensures this. If it could change, such types as Hashtable couldn't deal with this implementation. That's because it's expected that default hash code is a hash code of unique instance identifier (even although there is no such identifier :) ).
Range of GetHashCode values is range of integer (2^32).

Conclusion: It's enough to allocate 2^32 strongly-referenced objects to (must be easy on Win64) to reach the limit.

Finally, there is an explicit statement in object.GetHashCode reference in MSDN: The default implementation of the GetHashCode method does not guarantee unique return values for different objects. Furthermore, the .NET Framework does not guarantee the default implementation of the GetHashCode method, and the value it returns will be the same between different versions of the .NET Framework. Consequently, the default implementation of this method must not be used as a unique object identifier for hashing purposes.

Bradfordbradlee answered 16/7, 2009 at 20:38 Comment(0)

You do not actually need to modify anything on a class which requires only reference equality.

Also, formally, that is not a good implementation since it has poor distribution. A hash function should have a reasonable distribution since it improves hash bucket distribution, and indirectly, performance in collections which use hash tables. As I said, that is a formal answer, one of the guidelines when designing a hash function.

Vashtee answered 16/7, 2009 at 19:44 Comment(2)

What's poor about the distribution? If, to map hashes to buckets, you divide the hash by the number of buckets and use the remainder, it seems to me that my implementation would distribute the objects equally/evenly over all buckets. – Incarnate 16/7, 2009 at 19:51

That would be true if that was the hash bucket mapping algorithm. See concentric.net/~Ttwang/tech/inthash.htm for example. Quote: "For a hash function, the distribution should be uniform. This implies when the hash result is used to calculate hash bucket address, all buckets are equally likely to be picked. In addition, similar hash keys should be hashed to very different hash results. Ideally, a single bit change in the hash key should influence all bits of the hash result." – Vashtee 16/7, 2009 at 20:5

Recommended topics

Hot tags