Why is ValueType.GetHashCode() implemented like it is?
Asked Answered
O

5

28

From ValueType.cs

**Action: Our algorithm for returning the hashcode is a little bit complex. We look 
**        for the first non-static field and get it's hashcode.  If the type has no 
**        non-static fields, we return the hashcode of the type. We can't take the
**        hashcode of a static member because if that member is of the same type as 
**        the original type, we'll end up in an infinite loop.

I got bitten by this today when I was using a KeyValuePair as a key in a Dictionary (it stored xml attribute name (enum) and it's value (string)), and expected for it to have it's hashcode computed based on all its fields, but according to implementation it only considered the Key part.

Example (c/p from Linqpad):

void Main()
{
    var kvp1 = new KeyValuePair<string, string>("foo", "bar");
    var kvp2 = new KeyValuePair<string, string>("foo", "baz");

    // true
    (kvp1.GetHashCode() == kvp2.GetHashCode()).Dump();
}

The first non-static field I guess means the first field in declaratin order, which could also cause trouble when changing variable order in source for whatever reason, and believing it doesn't change the code semantically.

Oxazine answered 1/10, 2010 at 17:28 Comment(0)
R
36

UPDATE: This answer was (in part) the basis of a blog article I wrote which goes into more details about the design characteristics of GetHashcode. Thanks for the interesting question!


I didn't implement it and I haven't talked to the people who did. But I can point out a few things.

(Before I go on, note that here I am specifically talking about hash codes for the purposes of balancing hash tables where the contents of the table are chosen by non-hostile users. The problems of hash codes for digital signing, redundancy checking, or ensuring good performance of a hash table when some of the users are mounting denial-of-service attacks against the table provider are beyond the scope of this discussion.)

First, as Jon correctly notes, the given algorithm does implement the required contract of GetHashCode. It might be sub-optimal for your purposes, but it is legal. All that is required is that things that compare equal have equal hash codes.

So what are the "nice to haves" in addition to that contract? A good hash code implementation should be:

1) Fast. Very fast! Remember, the whole point of the hash code in the first place is to rapidly find a relatively empty slot in a hash table. If the O(1) computation of the hash code is in practice slower than the O(n) time taken to do the lookup naively then the hash code solution is a net loss.

2) Well distributed across the space of 32 bit integers for the given distribution of inputs. The worse the distribution across the ints, the more like a naive linear lookup the hash table is going to be.

So, how would you make a hash algorithm for arbitrary value types given those two conflicting goals? Any time you spend on a complex hash algorithm that guarantees good distribution is time poorly spent.

A common suggestion is "hash all of the fields and then XOR together the resulting hash codes". But that is begging the question; XORing two 32 bit ints only gives good distribution when the inputs themselves are extremely well-distributed and not related to each other, and that is an unlikely scenario:

// (Updated example based on good comment!)
struct Control
{
    string name;
    int x;
    int y;
}

What is the likelihood that x and y are well-distributed over the entire range of 32 bit integers? Very low. Odds are much better that they are both small and close to each other, in which case xoring their hash codes together makes things worse, not better. xoring together integers that are close to each other zeros out most of the bits.

Furthermore, this is O(n) in the number of fields! A value type with a lot of small fields would take a comparatively long time to compute the hash code.

Basically the situation we're in here is that the user didn't provide a hash code implementation themselves; either they don't care, or they don't expect this type to ever be used as a key in a hash table. Given that you have no semantic information whatsoever about the type, what's the best thing to do? The best thing to do is whatever is fast and gives good results most of the time.

Most of the time, two struct instances that differ will differ in most of their fields, not just one of their fields, so just picking one of them and hoping that it's the one that differs seems reasonable.

Most of the time, two struct instances that differ will have some redundancy in their fields, so combining the hash values of many fields together is likely to decrease, not increase, the entropy in the hash value, even as it consumes the time that the hash algorithm is designed to save.

Compare this with the design of anonymous types in C#. With anonymous types we do know that it is highly likely that the type is being used as a key to a table. We do know that it is highly likely that there will be redundancy across instances of anonymous types (because they are results of a cartesian product or other join). And therefore we do combine the hash codes of all of the fields into one hash code. If that gives you bad performance due to the excess number of hash codes being computed, you are free to use a custom nominal type rather than the anonymous type.

Reproachless answered 1/10, 2010 at 18:4 Comment(3)
Your Point example wasn't a good one. You'll actually get a 'good' hash out of that one. I left a post to describe why it is different.Fringe
Assuming you're not just doing a XOR, combining multiple correlated hashcodes will not generally decrease the entropy - and there's no particularly pressing reason to do a XOR.Paramagnet
@EamonNerbonne: The only advantage I can see for XOR is that in languages which have no means of requesting "unchecked" signed arithmetic, it allows the full range of output values without manual masking. Otherwise, it's in every way I can see inferior to +. Addition is just as fast as xor in languages which can support unchecked signed arithmetic, and performs adequately in the common cases where two fields are always equal to each other or differ by a constant.Saffren
F
48

The actual implementation of ValueType.GetHashCode() doesn't quite match the comment. It has two versions of the algorithm, fast and slow. It first checks if the struct contains any members of a reference type and if there is any padding between the fields. Padding is empty space in a structure value, created when the JIT compiler aligns the fields. There's padding in a struct that contains bool and int (3 bytes) but no padding when it contains int and int, they fit snugly together.

Without a reference and without padding, it can do the fast version since every bit in the structure value is a bit that belongs to a field value. It simply xors 4 bytes at a time. You'll get a 'good' hash code that considers all the members. Many simple structure types in the .NET framework behave this way, like Point and Size.

Failing that test, it does the slow version, the moral equivalent of reflection. That's what you get, your KeyValuePair<> contains references. And this one only checks the first candidate field, like the comment says. This is surely a perf optimization, avoiding burning too much time.

Yes, nasty detail and not that widely known. It is usually discovered when somebody notices that their collection code sucks mud.

One more excruciating detail: the fast version has a bug that bytes when the structure contains a field of a type decimal. The values 12m and 12.0m are logically equal but they don't have the same bit pattern. GetHashCode() will say that they are not equal. Ouch.

Fringe answered 1/10, 2010 at 19:43 Comment(6)
More generally, the bug you mention applies to any valuetype for which equality does not entail bitwise equality, of which decimal is just one example. Really, the optimised fast-track version should not be used where a struct member overrides equals or gethashcode. Probably just not using it whenever a member is a struct would be better (rather than spending more time seeing if one could do the fast version, than doing the fast version saves).Estimate
@JonHanna: I would argue that the so-called "fast-track" version is the way Object.Equals should work. If one has two immutable containers whose corresponding elements all report Equal, replacing references to the second with references to the first (or vice versa) should be a safe way to save memory and expedite future comparisons. Such substitution is safe for types where Equals only returns true for equivalent items, but unsafe for types where non-equivalent items may report themselves as "equal" to each other. The fact that 12m.Equals(12.0m) doesn't really mean it should.Saffren
@Hans: Given two instances of arbitrary type T, is there any way to compare them in the fashion that the "fast-track comparison" used to use? For example, suppose one has some immutable sequences of Decimal, and wishes to consolidate references to identical lists. An immutable sequence with the single item 12.0m should not in general be considered substitutable for one with the single item 12m. In the absence of a type-independent comparison method, what would be the best way to compare float, double, or Decimal values for equivalence?Saffren
just curious, can I find this info anywhere else? Google doesn't help me (or maybe my keywords are stupid)Ritualism
And why version 1 is faster? Is it because it doesn't call gethashcode for all it's fields, but uses their values instead? To me version 2 is not slow at all - just call gethashcode onceRitualism
@Hans What happens if the field has padding and no reference types? Jon Skeet's answer to this question suggests that KeyValuePair<ushort, uint> (I assume this is padded) considers no fields at all as it gives the same hash code for all instances!Heteroplasty
R
36

UPDATE: This answer was (in part) the basis of a blog article I wrote which goes into more details about the design characteristics of GetHashcode. Thanks for the interesting question!


I didn't implement it and I haven't talked to the people who did. But I can point out a few things.

(Before I go on, note that here I am specifically talking about hash codes for the purposes of balancing hash tables where the contents of the table are chosen by non-hostile users. The problems of hash codes for digital signing, redundancy checking, or ensuring good performance of a hash table when some of the users are mounting denial-of-service attacks against the table provider are beyond the scope of this discussion.)

First, as Jon correctly notes, the given algorithm does implement the required contract of GetHashCode. It might be sub-optimal for your purposes, but it is legal. All that is required is that things that compare equal have equal hash codes.

So what are the "nice to haves" in addition to that contract? A good hash code implementation should be:

1) Fast. Very fast! Remember, the whole point of the hash code in the first place is to rapidly find a relatively empty slot in a hash table. If the O(1) computation of the hash code is in practice slower than the O(n) time taken to do the lookup naively then the hash code solution is a net loss.

2) Well distributed across the space of 32 bit integers for the given distribution of inputs. The worse the distribution across the ints, the more like a naive linear lookup the hash table is going to be.

So, how would you make a hash algorithm for arbitrary value types given those two conflicting goals? Any time you spend on a complex hash algorithm that guarantees good distribution is time poorly spent.

A common suggestion is "hash all of the fields and then XOR together the resulting hash codes". But that is begging the question; XORing two 32 bit ints only gives good distribution when the inputs themselves are extremely well-distributed and not related to each other, and that is an unlikely scenario:

// (Updated example based on good comment!)
struct Control
{
    string name;
    int x;
    int y;
}

What is the likelihood that x and y are well-distributed over the entire range of 32 bit integers? Very low. Odds are much better that they are both small and close to each other, in which case xoring their hash codes together makes things worse, not better. xoring together integers that are close to each other zeros out most of the bits.

Furthermore, this is O(n) in the number of fields! A value type with a lot of small fields would take a comparatively long time to compute the hash code.

Basically the situation we're in here is that the user didn't provide a hash code implementation themselves; either they don't care, or they don't expect this type to ever be used as a key in a hash table. Given that you have no semantic information whatsoever about the type, what's the best thing to do? The best thing to do is whatever is fast and gives good results most of the time.

Most of the time, two struct instances that differ will differ in most of their fields, not just one of their fields, so just picking one of them and hoping that it's the one that differs seems reasonable.

Most of the time, two struct instances that differ will have some redundancy in their fields, so combining the hash values of many fields together is likely to decrease, not increase, the entropy in the hash value, even as it consumes the time that the hash algorithm is designed to save.

Compare this with the design of anonymous types in C#. With anonymous types we do know that it is highly likely that the type is being used as a key to a table. We do know that it is highly likely that there will be redundancy across instances of anonymous types (because they are results of a cartesian product or other join). And therefore we do combine the hash codes of all of the fields into one hash code. If that gives you bad performance due to the excess number of hash codes being computed, you are free to use a custom nominal type rather than the anonymous type.

Reproachless answered 1/10, 2010 at 18:4 Comment(3)
Your Point example wasn't a good one. You'll actually get a 'good' hash out of that one. I left a post to describe why it is different.Fringe
Assuming you're not just doing a XOR, combining multiple correlated hashcodes will not generally decrease the entropy - and there's no particularly pressing reason to do a XOR.Paramagnet
@EamonNerbonne: The only advantage I can see for XOR is that in languages which have no means of requesting "unchecked" signed arithmetic, it allows the full range of output values without manual masking. Otherwise, it's in every way I can see inferior to +. Addition is just as fast as xor in languages which can support unchecked signed arithmetic, and performs adequately in the common cases where two fields are always equal to each other or differ by a constant.Saffren
G
9

It should still obey the contract of GetHashCode even if the field order changes: equal values will have equal hash codes, within the lifetime of that process.

In particular:

  • Non-equal values don't have to have non-equal hash codes
  • Hash codes don't have to be consistent across processes (you can change an implementation, rebuild, and everything should still work - you shouldn't be persisting hash codes, basically)

Now I'm not saying that ValueType's implementation is a great idea - it will cause performance suckage in various ways... but I don't think it's actually broken.

Gleeman answered 1/10, 2010 at 17:32 Comment(0)
E
3

Well, there are pros and cons to any implementation of GetHashCode(). These are of course the things we weigh up when implementing our own, but in the case of ValueType.GetHashCode() there is a particular difficulty in that they don't have much information on what the actual details of the concrete type will be. Of course, this often happens to us when we are creating an abstract class or one intended to be the base of classes that will add a lot more in terms of state, but in those cases we have an obvious solution of just using the default implementation of object.GetHashCode() unless a derived class cares to override it there.

With ValueType.GetHashCode() they don't have this luxury as the primary difference between a value type and a reference type is, despite the popularity of talking about implementation details of stack vs. heap, the fact that for a value type equivalence relates to value while for an object type equivalence relates to identity (even when an object defines a different form of equivalence by overriding Equals() and GetHashCode() the concept of reference-equality still exists and is still useful.

So, for the Equals() method the implementation is obvious; check the two objects are the same type, and if it is then check also that all fields are equal (actually there's an optimisation which does a bitwise comparison in some cases, but that's an optimisation on the same basic idea).

What to do for GetHashCode()? There simply is no perfect solution. One thing they could do is some sort of mult-then-add or shift-then-xor on every field. That would likely give a pretty good hashcode, but could be expensive if there were a lot of fields (never mind that its not recommended to have value-types that have lots of fields, the implementer has to consider that they still can, and indeed there might even be times when it makes sense, though I honestly can't imagine a time where it both makes sense and it also makes sense to hash it). If they knew some fields were rarely different between instances they could ignore those fields and still have a pretty good hashcode, while also being quite fast. Finally, they can ignore most fields, and hope that the one(s) they don't ignore vary in value most of the time. They went for the most extreme version of the latter.

(The matter of what is done when there is no instance fields is another matter and a pretty good choice, such value types are equal to all other instances of the same type, and they've a hashcode that matches that).

So, it's an implementation that sucks if you are hashing lots of values where the first field is the same (or otherwise returns the same hashcode), but other implementations would suck in other cases (Mono goes for xoring all the fields' hashcodes together, better in your case, worse in others).

The matter of changing field order doesn't matter, as hashcode is quite clearly stated as only remaining valid for the lifetime of a process and not being suitable for most cases where they could be persisted beyond that (can be useful in some caching situations where it doesn't hurt if things aren't found correctly after a code change).

So, not great, but nothing would be perfect. It goes to show that one must always consider both sides of what "equality" means when using an object as a key. It's easily fixed in your case with:

public class KVPCmp<TKey, TValue> : IEqualityComparer<KeyValuePair<TKey, TValue>>, IEqualityComparer
{
  bool IEqualityComparer.Equals(object x, object y)
  {
      if(x == null)
        return y == null;
      if(y == null)
        return false;
      if(!(x is KeyValuePair<TKey, TValue>) || !(y is KeyValuePair<TKey, TValue>))
        throw new ArgumentException("Comparison of KeyValuePairs only.");
      return Equals((KeyValuePair<TKey, TValue>) x, (KeyValuePair<TKey, TValue>) y);
  }
  public bool Equals(KeyValuePair<TKey, TValue> x, KeyValuePair<TKey, TValue> y)
  {
      return x.Key.Equals(y.Key) && x.Value.Equals(y.Value);
  }
  public int GetHashCode(KeyValuePair<TKey, TValue> obj)
  {
      int keyHash = obj.GetHashCode();
      return ((keyHash << 16) | (keyHash >> 16)) ^ obj.Value.GetHashCode();
  }
  public int GetHashCode(object obj)
  {
      if(obj == null)
        return 0;
      if(!(obj is KeyValuePair<TKey, TValue>))
       throw new ArgumentException();
      return GetHashCode((KeyValuePair<TKey, TValue>)obj);
  }
}

Use this as your comparator when creating your dictionary, and all should be well (you only need the generic comparator methods really, but leaving the rest in does no harm and can be useful to have sometimes).

Estimate answered 1/10, 2010 at 17:59 Comment(0)
O
0

Thank you all on very, very informative answers. I knew that there had to be some rationale in that decision but I wish it was documented better. I'm not able to use v4 of the framework so there is no Tuple<>, and that was the main reason I decided to piggyback on KeyValuePair struct. But I guess there is no cutting corners and I'll have to roll my own. Once more, thank you all.

Oxazine answered 2/10, 2010 at 14:15 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.