Using GetHashCode to test equality in Equals override
Asked Answered
N

8

10

Is it ok to call GetHashCode as a method to test equality from inside the Equals override?

For example, is this code acceptable?

public class Class1
{
  public string A
  {
    get;
    set;
  }

  public string B
  {
    get;
    set;
  }

  public override bool Equals(object obj)
  {
    Class1 other = obj as Class1;
    return other != null && other.GetHashCode() == this.GetHashCode();
  }

  public override int GetHashCode()
  {
    int result = 0;
    result = (result ^ 397) ^ (A == null ? 0 : A.GetHashCode());
    result = (result ^ 397) ^ (B == null ? 0 : B.GetHashCode());
    return result;
  }
}
Nikaniki answered 22/11, 2010 at 18:50 Comment(5)
As a developer, you owe it to yourself to fully understand what hashes are used for and how they relate to hash tables (as implemented by Dictionary and HashSet, among others). The wikipedia article for hashtable is a good start: en.wikipedia.org/wiki/Hash_tableYonyona
@Yonyona - that is exactly what this question has explained to me in more detail than I originally understood or could call to mind.Nikaniki
Not only is the equality check wrong, the code is strange. Why are you multiplying zero by 397? I can tell you right now, the answer is going to be zero, so why make the machine compute it? Why xor zero with a value; that's an identity operation.Anglophobe
Yeah, that was dumb. I corrected it, hopefully it's correct now.Nikaniki
Should see this too why-use-gethashcode-over-equalsHilaria
A
15

The others are right; your equality operation is broken. To illustrate:

public static void Main()
{
    var c1 = new Class1() { A = "apahaa", B = null };
    var c2 = new Class1() { A = "abacaz", B = null };
    Console.WriteLine(c1.Equals(c2));
}

I imagine you want the output of that program to be "false" but with your definition of equality it is "true" on some implementations of the CLR.

Remember, there are only about four billion possible hash codes. There are way more than four billion possible six letter strings, and therefore at least two of them have the same hash code. I've shown you two; there are infinitely many more.

In general you can expect that if there are n possible hash codes then the odds of getting a collision rise dramatically once you have about the square root of n elements in play. This is the so-called "birthday paradox". For my article on why you shouldn't rely upon hash codes for equality, click here.

Anglophobe answered 22/11, 2010 at 21:29 Comment(0)
P
7

No, it is not ok, because it's not

equality <=> hashcode equality.

It's just

equality => hashcode equality.

or in the other direction:

hashcode inequality => inequality.

Quoting http://msdn.microsoft.com/en-us/library/system.object.gethashcode.aspx:

If two objects compare as equal, the GetHashCode method for each object must return the same value. However, if two objects do not compare as equal, the GetHashCode methods for the two object do not have to return different values.

Political answered 22/11, 2010 at 18:55 Comment(0)
D
2

I would say, unless you want for Equals to basically mean "has the same hash code as" for your type, then no, because two strings may be different but share the same hash code. The probability may be small, but it isn't zero.

Denham answered 22/11, 2010 at 18:56 Comment(0)
C
1

No this is not an acceptable way to test for equality. It is very possible for 2 non-equal values to have the same hash code. This would cause your implementation of Equals to return true when it should return false

Crudden answered 22/11, 2010 at 18:55 Comment(0)
T
1

You can call GetHashCode to determine if the items are not equal, but if two objects return the same hash code, that doesn't mean they are equal. Two items can have the same hash code but not be equal.

If it's expensive to compare two items, then you can compare the hash codes. If they are unequal, then you can bail. Otherwise (the hash codes are equal), you have to do the full comparison.

For example:

public override bool Equals(object obj)
  {
    Class1 other = obj as Class1;
    if (other == null || other.GetHashCode() != this.GetHashCode())
        return false;
    // the hash codes are the same so you have to do a full object compare.
  }
Teeny answered 22/11, 2010 at 18:58 Comment(2)
With many objects, this will tend to be slower than using the built in comparison. If the objects are equal, you end up doing a full comparison and a GetHashCode. If they are not equal, you end up doing a call to GetHashCode, which probably reads in the entire object. Equals, on the other hand, probably only reads enough of the object to determine that the objects are not equal. That being said, in the case of complicated objects which are slow to compare but have a fast GetHashCode method (e.g. because it is computed in advance), this optimization will help a lot.Chameleon
@Brian, I agree that it's rarely useful for the reasons you say. I also don't think precomputed GetHashCode is often useful (as its so rarely used, esp. if you are using an IEqualityComparer implementation rather than the default GetHashCode). However, see my answer for a case where the fact that the hashcode is stored anyway (for other reasons) can make Jim's approach make sense.Gleanings
M
1

You cannot say that just because the hash codes are equal then the objects must be equal.

The only time you would call GetHashCode inside of Equals was if it was much cheaper to compute a hash value for an object (say, because you cache it) than to check for equality. In that case you could say if (this.GetHashCode() != other.GetHashCode()) return false; so that you could quickly verify that the objects were not equal.

So when would you ever do this? I wrote some code that takes screenshots at periodic intervals and tries to find how long it's been since the screen changed. Since my screenshots are 8MB and have relatively few pixels that change within the screenshot interval it's fairly expensive to search a list of them to find which ones are the same. A hash value is small and only has to be computed once per screenshot, making it easy to eliminate known non-equal ones. In fact, in my application I decided that having identical hashes was close enough to being equal that I didn't even bother to implement the Equals overload, causing the C# compiler to warn me that I was overloading GetHashCode without overloading Equals.

Midinette answered 22/11, 2010 at 19:10 Comment(0)
G
0

There is one case where using hashcodes as a short-cut on equality comparisons makes sense.

Consider the case where you are building a hashtable or hashset. In fact, let's just consider hashsets (hashtables extend that by also holding a value, but that isn't relevant).

There are various different approaches one can take, but in all of them you have a small number of slots the hashed values can be placed in, and we take either the open or closed approach (which just for fun, some people use the opposite jargon for to others); if we collide on the same slot for two different objects we can either store them in the same slot (but having a linked list or such for where the objects are actually stored) or by re-probing to pick a different slot (there are various strategies for this).

Now, with either approach, we're moving away from the O(1) complexity we want with a hashtable, and towards an O(n) complexity. The risk of this is inversely proportional to the number of slots available, so after a certain size we resize the hashtable (even if everything was ideal, we'd eventually have to do this if the number of items stored were greater than the number of slots).

Re-inserting the items on a resize will obviously depend on the hash codes. Because of this, while it rarely makes sense to memoise GetHashCode() in an object (it just isn't called often enough on most objects), it certainly does make sense to memoise it within the hash table itself (or perhaps, to memoise a produced result, such as if you re-hashed with a Wang/Jenkins hash to reduce the damage caused by bad GetHashCode() implementations).

Now, when we come to insert our logic is going to be something like:

  1. Get hash code for object.
  2. Get slot for object.
  3. If slot is empty, place object in it and return.
  4. If slot contains equal object, we're done for a hashset and have the position to replace the value for a hashtable. Do this, and return.
  5. Try next slot according to collision strategy, and return to item 3 (perhaps resizing if we loop this too often).

So, in this case we have to get the hash code before we compare for equality. We also have the hash code for existing objects already pre-computed to allow for resize. The combination of these two facts means that it makes sense to implement our comparison for item 4 as:

private bool IsMatch(KeyType newItem, KeyType storedItem, int newHash, int oldHash)
{
  return ReferenceEquals(newItem, storedItem) // fast, false negatives, no false positives (only applicable to reference types)
    ||
    (
      newHash == oldHash // fast, false positives, no fast negatives
      &&
      _cmp.Equals(newItem, storedItem) // slow for some types, but always correct result.
    );
}

Obviously, the advantage of this depends on the complexity of _cmp.Equals. If our key type was int then this would be a total waste. If our key type where string and we were using case-insensitive Unicode-normalised equality comparisons (so it can't even shortcut on length) then the saving could well be worth it.

Generally memoising hash codes doesn't make sense because they aren't used often enough to be a performance win, but storing them in the hashset or hashtable itself can make sense.

Gleanings answered 29/11, 2010 at 18:35 Comment(0)
H
0
  1. It's wrong implementation, as others have stated why.

  2. You should short circuit the equality check using GetHashCode like:

    if (other.GetHashCode() != this.GetHashCode()
        return false;
    

    in the Equals method only if you're certain the ensuing Equals implementation is much more expensive than GetHashCode which is not vast majority of cases.

  3. In this one implementation you have shown (which is 99% of the cases) its not only broken, its also much slower. And the reason? Computing the hash of your properties would almost certainly be slower than comparing them, so you're not even gaining in performance terms. The advantage of implementing a proper GetHashCode is when your class can be the key type for hash tables where hash is computed only once (and that value is used for comparison). In your case GetHashCode will be called multiple times if it's in a collection. Even though GetHashCode itself should be fast, it's not mostly faster than equivalent Equals.

    To benchmark, run your Equals (a proper implementation, taking out the current hash based implementation) and GetHashCode here

    var watch = Stopwatch.StartNew();
    for (int i = 0; i < 100000; i++) 
    {
        action(); //Equals and GetHashCode called here to test for performance.
    }
    watch.Stop();
    Console.WriteLine(watch.Elapsed.TotalMilliseconds);
    
Hilaria answered 15/12, 2013 at 19:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.