Why does C# not implement GetHashCode for Collections?
Asked Answered
H

7

22

I am porting something from Java to C#. In Java the hashcode of a ArrayList depends on the items in it. In C# I always get the same hashcode from a List...

Why is this?

For some of my objects the hashcode needs to be different because the objects in their list property make the objects non-equal. I would expect that a hashcode is always unique for the object's state and only equals another hashcode when the object is equal. Am I wrong?

Harpp answered 25/5, 2010 at 18:27 Comment(4)
You mean .Net, not C#.Elsaelsbeth
Having equal hashcodes is not a guarantee of equality. That's not what hashcode() is for.Fleabite
if it's true that Lists don't change their hashcode at all, then that seems like a sub-standard implementation. It's half-way reasonable to expect that changing contents of a list would in general change the hashcode to improve hashing. Otherwise putting in a dozen lists into a hashtable will cause a dozen collisions, which is far from ideal. I'm not saying that the hashes need to be unique, but at least have some distribution.Lyontine
It seems to me that all most answers here were sidetracked into talking about immutability or hashcode guarantees, instead of answering the original question. In C# collections, GetHashCode does not use the items in it because it matches collection equality, that is, instance equality, not structural equality.Benedikt
E
19

In order to work correctly, hashcodes must be immutable – an object's hash code must never change.

If an object's hashcode does change, any dictionaries containing the object will stop working.

Since collections are not immutable, they cannot implement GetHashCode.
Instead, they inherit the default GetHashCode, which returns a (hopefully) unique value for each instance of an object. (Typically based on a memory address)

Elsaelsbeth answered 26/5, 2010 at 13:16 Comment(8)
+1 - only answer to mention dictionaries and immutable hashcodesArtery
Wish I could -1. If the Hashcode of a mutable object does not change, it will violate the equals contract, if two objects change from different (=hashcode doesn't matter) to equal (=hashcode absolutely must be the same, though it was most likely not the same before the change). It is in the responsibility of the coder to make sure that a mutable object does not change at all, while its hashcode is in use. If an object that is contained in a hashed dictionary is changed, this is a bug - regardless of the hashcode implementation.Hog
@Alex: I meant that a mutable object should not have a hashcode, period.Elsaelsbeth
I did understand, but that's not necessary - a mutable object may have a hashcode, but then it is necessary to never change it while the hashcode is used. And that's easy, too: readonly flag in "GetHashcode", ClearReadOnly method, in all writing methods and properties: "if (readonly) { throw new ReadonlyException(); }". This way, you are forced to think and explicitly call ClearReadOnly(), before an object can be changed. It is easier to have hashed objects immutable, but not necessary. Haviong to list mutable hashless objects is a very good performance killer.Hog
This is very much not true. It's close to the opposite. It is true that an object must not be mutated in a way that affects its equality definition while it is being used as a key, but this does not make immutability a requirement. And, since a mutable object with an equals override could be stored in a hash-based structure for a while, removed and mutated, and then put in another again, the implementation of GetHashCode() must mutate with it. Identity-based hashcodes don't mutate because identity by definition doesn't mutate, but non-identity based must mutate.Galipot
@JonHanna: Wrong. The hashcode must not change while the object is in a dictionary, or the dictionary won't be able to find it (since it put it in the bucket for the original hashcode). IIRC, this is stated somewhere in MSDN. You're right that proper behavior for GetHashCode() is to follow Equals().Elsaelsbeth
Yes, but the hashcode must not change, because the object itself must not change during this period. If we (A) put the object in a dictionary, (B) remove it, (C) change it in a way reflected by a different hashcode, (D) put it back in a dictionary, then the hashcode must change at (C). Since GetHashCode() has no information on whether the object is stored in such a structure it must always assume it isn't (if it is, that's the other guy's bug) and change it.Galipot
@JonHanna your last comment is to the point and very clear! Hope people realise that.Pomegranate
G
10

Hashcodes must depend upon the definition of equality being used so that if A == B then A.GetHashCode() == B.GetHashCode() (but not necessarily the inverse; A.GetHashCode() == B.GetHashCode() does not entail A == B).

By default, the equality definition of a value type is based on its value, and of a reference type is based on it's identity (that is, by default an instance of a reference type is only equal to itself), hence the default hashcode for a value type is such that it depends on the values of the fields it contains* and for reference types it depends on the identity. Indeed, since we ideally want the hashcodes for non-equal objects to be different particularly in the low-order bits (most likely to affect the value of a re-hashing), we generally want two equivalent but non-equal objects to have different hashes.

Since an object will remain equal to itself, it should also be clear that this default implementation of GetHashCode() will continue to have the same value, even when the object is mutated (identity does not mutate even for a mutable object).

Now, in some cases reference types (or value types) re-define equality. An example of this is string, where for example "ABC" == "AB" + "C". Though there are two different instances of string compared, they are considered equal. In this case GetHashCode() must be overridden so that the value relates to the state upon which equality is defined (in this case, the sequence of characters contained).

While it is more common to do this with types that also are immutable, for a variety of reasons, GetHashCode() does not depend upon immutability. Rather, GetHashCode() must remain consistent in the face of mutability - change a value that we use in determining the hash, and the hash must change accordingly. Note though, that this is a problem if we are using this mutable object as a key into a structure using the hash, as mutating the object changes the position in which it should be stored, without moving it to that position (it's also true of any other case where the position of an object within a collection depends on its value - e.g. if we sort a list and then mutate one of the items in the list, the list is no longer sorted). However, this doesn't mean that we must only use immutable objects in dictionaries and hashsets. Rather it means that we must not mutate an object that is in such a structure, and making it immutable is a clear way to guarantee this.

Indeed, there are quite a few cases where storing mutable objects in such structures is desirable, and as long as we don't mutate them during this time, this is fine. Since we don't have the guarantee immutability brings, we then want to provide it another way (spending a short time in the collection and being accessible from only one thread, for example).

Hence immutability of key values is one of those cases where something is possible, but generally a idea. To the person defining the hashcode algorithm though, it's not for them to assume any such case will always be a bad idea (they don't even know the mutation happened while the object was stored in such a structure); it's for them to implement a hashcode defined on the current state of the object, whether calling it in a given point is good or not. Hence for example, a hashcode should not be memoised on a mutable object unless the memoisation is cleared on every mutate. (It's generally a waste to memoise hashes anyway, as structures that hit the same objects hashcode repeatedly will have their own memoisation of it).

Now, in the case in hand, ArrayList operates on the default case of equality being based on identity, e.g.:

ArrayList a = new ArrayList();
ArrayList b = new ArrayList();
for(int i = 0; i != 10; ++i)
{
  a.Add(i);
  b.Add(i);
}
return a == b;//returns false

Now, this is actually a good thing. Why? Well, how do you know in the above that we want to consider a as equal to b? We might, but there are plenty of good reasons for not doing so in other cases too.

What's more, it's much easier to redefine equality from identity-based to value-based, than from value-based to identity-based. Finally, there are more than one value-based definitions of equality for many objects (classic case being the different views on what makes a string equal), so there isn't even a one-and-only definition that works. For example:

ArrayList c = new ArrayList();
for(short i = 0; i != 10; ++i)
{
  c.Add(i);
}

If we considered a == b above, should we consider a == c aslo? The answer depends on just what we care about in the definition of equality we are using, so the framework could't know what the right answer is for all cases, since all cases don't agree.

Now, if we do care about value-based equality in a given case we have two very easy options. The first is to subclass and over-ride equality:

public class ValueEqualList : ArrayList, IEquatable<ValueEqualList>
{
  /*.. most methods left out ..*/
  public Equals(ValueEqualList other)//optional but a good idea almost always when we redefine equality
  {
    if(other == null)
      return false;
    if(ReferenceEquals(this, other))//identity still entails equality, so this is a good shortcut
      return true;
    if(Count != other.Count)
      return false;
    for(int i = 0; i != Count; ++i)
      if(this[i] != other[i])
        return false;
    return true;
  }
  public override bool Equals(object other)
  {
    return Equals(other as ValueEqualList);
  }
  public override int GetHashCode()
  {
    int res = 0x2D2816FE;
    foreach(var item in this)
    {
        res = res * 31 + (item == null ? 0 : item.GetHashCode());
    }
    return res;
  }
}

This assumes that we will always want to treat such lists this way. We can also implement an IEqualityComparer for a given case:

public class ArrayListEqComp : IEqualityComparer<ArrayList>
{//we might also implement the non-generic IEqualityComparer, omitted for brevity
  public bool Equals(ArrayList x, ArrayList y)
  {
    if(ReferenceEquals(x, y))
      return true;
    if(x == null || y == null || x.Count != y.Count)
      return false;
    for(int i = 0; i != x.Count; ++i)
      if(x[i] != y[i])
        return false;
    return true;
  }
  public int GetHashCode(ArrayList obj)
  {
    int res = 0x2D2816FE;
    foreach(var item in obj)
    {
        res = res * 31 + (item == null ? 0 : item.GetHashCode());
    }
    return res;
  }
}

In summary:

  1. The default equality definition of a reference type is dependant upon identity alone.
  2. Most of the time, we want that.
  3. When the person defining the class decides that this isn't what is wanted, they can override this behaviour.
  4. When the person using the class wants a different definition of equality again, they can use IEqualityComparer<T> and IEqualityComparer so their that dictionaries, hashmaps, hashsets, etc. use their concept of equality.
  5. It's disastrous to mutate an object while it is the key to a hash-based structure. Immutability can be used of ensure this doesn't happen, but is not compulsory, nor always desirable.

All in all, the framework gives us nice defaults and detailed override possibilities.

*There is a bug in the case of a decimal within a struct, because there is a short-cut used in some cases with stucts when it is safe and not othertimes, but while a struct containing a decimal is one case when the short-cut is not safe, it is incorrectly identified as a case where it is safe.

Galipot answered 14/11, 2011 at 11:24 Comment(2)
Your second point is valid in case of collections, but otherwise for simple pocos, I never wanted a reference equality!Pomegranate
@nawfal, I've wanted it plenty of times for non-collections.Galipot
K
8

Yes, you are wrong. In both Java and C#, being equal implies having the same hash-code, but the converse is not (necessarily) true.

See GetHashCode for more information.

Kinross answered 25/5, 2010 at 18:38 Comment(0)
C
3

It is not possible for a hashcode to be unique across all variations of most non-trivial classes. In C# the concept of List equality is not the same as in Java (see here), so the hash code implementation is also not the same - it mirrors the C# List equality.

Countershading answered 25/5, 2010 at 18:32 Comment(0)
I
2

You're only partly wrong. You're definitely wrong when you think that equal hashcodes means equal objects, but equal objects must have equal hashcodes, which means that if the hashcodes differ, so do the objects.

Isagogics answered 25/5, 2010 at 18:52 Comment(0)
L
2

The core reasons are performance and human nature - people tend to think about hashes as something fast but it normally requires traversing all elements of an object at least once.

Example: If you use a string as a key in a hash table every query has complexity O(|s|) - use 2x longer strings and it will cost you at least twice as much. Imagine that it was a full blown tree (just a list of lists) - oops :-)

If full, deep hash calculation was a standard operation on a collection, enormous percentage of progammers would just use it unwittingly and then blame the framework and the virtual machine for being slow. For something as expensive as full traversal it is crucial that a programmer has to be aware of the complexity. The only was to achieve that is to make sure that you have to write your own. It's a good deterrent as well :-)

Another reason is updating tactics. Calculating and updating a hash on the fly vs. doing the full calculation every time requires a judgement call depending on concrete case in hand.

Immutabilty is just an academic cop out - people do hashes as a way of detecting a change faster (file hashes for example) and also use hashes for complex structures which change all the time. Hash has many more uses beyong the 101 basics. The key is again that what to use for a hash of a complex object has to be a judgement call on a case by case basis.

Using object's address (actually a handle so it doesn't change after GC) as a hash is actually the case where the hash value remains the same for arbitrary mutable object :-) The reason C# does it is that it's cheap and again nudges people to calculate their own.

Languorous answered 31/7, 2010 at 0:50 Comment(0)
N
-1

Why is too philosophical. Create helper method (may be extension method) and calculate hashcode as you like. May be XOR elements' hashcodes

Nullipore answered 25/5, 2010 at 18:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.