Implementation of Object.GetHashCode()
Asked Answered
C

1

16

I'm reading Effective C# and there is a comment about Object.GetHashCode() that I didn't understand:

Object.GetHashCode() uses an internal field in the System.Object class to generate the hash value. Each object created is assigned a unique object key, stored as an integer, when it is created.
These keys start at 1 and increment every time a new object of any type gets created. The object identity field is set in the System.Object constructor and cannot be modified later. Object.GetHashCode() returns this value as the hash code for a given object.

I tried to look at the documentation of Object.GetHashCode() and didn't find any information about this.

I wrote the simple piece of code to print the hash code of newly generated objects:

using System;

namespace TestGetHashCode
{
    class Program
    {
        static void Main(string[] args)
        {
            for (int i = 0; i < 100; i++)
            {
                object o = new object();
                Console.WriteLine(o.GetHashCode());
            }
        }
    }
}

The first few numbers that were printed were:

37121646,
45592480,
57352375,
2637164,
41014879,
3888474,
25209742,
26966483,
31884011

Which didn't seem to fit that

These keys start at 1 and increment every time a new object of any type gets created...Object.GetHashCode() returns this value

Then, in order to find this "internal field in the System.Object" I tried using ReSharper decompiled sources but the code I found was

[TargetedPatchingOptOut("Performance critical to inline across NGen image boundaries")]
[__DynamicallyInvokable]
public virtual int GetHashCode()
{
  return RuntimeHelpers.GetHashCode(this);
}

and again using decompiled sources I found that RuntimeHelpers.GetHashCode was implemented as

[SecuritySafeCritical]
[__DynamicallyInvokable]
[MethodImpl(MethodImplOptions.InternalCall)]
public static int GetHashCode(object o);

following the MethodImpl attribute it seems that I can't view the implementation and this is a dead end for me.

Can someone please explain the comment by the author (the first quote) ?

What is the internal field within the Object class and how it is used for the implementation of the Object.GetHashCode()?

Camembert answered 28/11, 2014 at 20:53 Comment(11)
This can be usefull: #720677Ecstatic
@Yuval Itzchakov: is it irony? Otherwise, why do you think so?Calctufa
@Yuval Itzchakov: why do you think GetHashCode() has to return a unique identifier? It's called Get*Hash*Code() not Get*UniqueIdentifier*() with purposeCalctufa
@Yuval Itzchakov: "that's not how you accomplish such a task" --- it is exactly how you do that. For a type that does not have any changeable members you don't have other choice. So if you like you may even implement it as a return 42; and everything will work (not as efficient as it could be, but nothing will break).Calctufa
You're right. For some reason i overlooked the fact that this is object. Although implementing it as return 42 will generate equality for all returned objects. Not sure what you ment by thatDonatist
@Yuval Itzchakov: "will generate equality for all returned objects" --- that's incorrect. You should never treat objects equal if their hash codes match. It actually is the opposite: the equal objects MUST return equal hash codes, but equal hash codes does not mean the objects are equal. msdn.microsoft.com/en-us/library/… "You should not assume that equal hash codes imply object equality."Calctufa
But that way you're leaving room for mistakes, as you can't control all underlying implementations who look solely at hashcode equality.Donatist
@Yuval Itzchakov: "who look solely at hashcode equality" --- no one should do that. If someone does something stupid - it's not the reason to encourage them for continuing doing that. It's not the case for .net. If it's the case for a 3rd party library - it must be reported as a bug and fixed.Calctufa
No one should, you're right. But you still wouldn't want to leave it up to them to make such mistakes by simply generating different hash codes.Donatist
@Yuval Itzchakov: I'm not sure what we are discussing here. There is a language specification that states how it should be implemented, whether one likes that or not.Calctufa
@YuvalItzchakov Its not a choice to decide wether you should implement hash codes without "false positives", you simply can't. A hash code is a finite number, an Int32 to be precise which means that Int64, for example, can not have a unique hash code for every posible value (there is bound to be at least two long with the same hash). String is another obvious examples, you can create practically infinite different strings which evidently can not have unique hashes...and the list goes on and on...Tyr
C
21

Okay, I'd better write this up. The book is very inaccurate. The value for Object.GetHashCode() is generated inside the CLR and is calculated on demand, whenever GetHashCode() is called the first time. I'll quote the code from the SSCLI20 distribution, clr/src/vm/thread.h has the function that produces the number, it looks like this (edited for readability):

inline DWORD GetNewHashCode()
{
    // Every thread has its own generator for hash codes so that we won't get into a 
    // situation where two threads consistently give out the same hash codes.
    // Choice of multiplier guarantees period of 2**32
    // see Knuth Vol 2 p16 (3.2.1.2 Theorem A).
    DWORD multiplier = m_ThreadId*4 + 5;
    m_dwHashCodeSeed = m_dwHashCodeSeed*multiplier + 1;
    return m_dwHashCodeSeed;
}

After which it is stored in the so-called sync block of the object so subsequent calls return the same value. Only 26 of the generated 32 bits are actually stored, the sync block needs space for some status bits. Still plenty good enough to generate a very high quality hash code, collisions are quite rare.

The presence of the m_ThreadId variable in that code can use an explanation. The random number generator seed is stored for each individual thread. A trick to avoid having to take a lock.

The m_dwHashCodeSeed is initialized in the Thread constructor like this:

   // Initialize this variable to a very different start value for each thread
   // Using linear congruential generator from Knuth Vol. 2, p. 102, line 24
   dwHashCodeSeed = dwHashCodeSeed * 1566083941 + 1;
   m_dwHashCodeSeed = dwHashCodeSeed;

with:

   static  DWORD dwHashCodeSeed = 123456789;
Clotho answered 28/11, 2014 at 22:33 Comment(3)
Thank you for the answer. From this code it seems that GetNewHashCode is a function that depends only on m_ThreadId so for the same thread it generates the same hash code in every call. What am I missing here ?Camembert
No, it depends on m_dwHashCodeSeed, updated each time a new hash code is generated. Every thread starts out with a different seed, bottom snippet. m_ThreadId just adds an extra level of randomness, Donald Knuth helped with that. Don't focus too much on threads being involved with this, it is just a lock-less code trick. Locks are expensive. There is some trickery involved with two threads calling GetHashCode at the same time, I left that out to keep it semi-comprehensible.Clotho
I understand how this works and thank you for the write up, but why in the world does it work like this? "Hoping" for no conflicts like they're doing is not needed right? Why not generate a hash for the object based on that object's unique identifier? Or, if none exist, use a UID that increments by 1 every time and store that in the SYNC block? It's the same as the above solution with just the absolute minimum rate of conflicts possible (conflicts only when the UID overflows)Equivocate

© 2022 - 2024 — McMap. All rights reserved.