How should I go about implementing Object.GetHashCode() for complex equality?

G

2

10

Basically, I have the following so far:

class Foo {
    public override bool Equals(object obj)
    {
        Foo d = obj as Foo ;
        if (d == null)
            return false;

        return this.Equals(d);
    }

    #region IEquatable<Foo> Members

    public bool Equals(Foo other)
    {
        if (this.Guid != String.Empty && this.Guid == other.Guid)
            return true;
        else if (this.Guid != String.Empty || other.Guid != String.Empty)
            return false;

        if (this.Title == other.Title &&
            this.PublishDate == other.PublishDate &&
            this.Description == other.Description)
            return true;

        return false;
    }
}

So, the problem is this: I have a non-required field Guid, which is a unique identifier. If this isn't set, then I need to try to determine equality based on less accurate metrics as an attempt at determining if two objects are equal. This works fine, but it make GetHashCode() messy... How should I go about it? A naive implementation would be something like:

public override int GetHashCode() {
    if (this.Guid != String.Empty)
        return this.Guid.GetHashCode();

    int hash = 37;
    hash = hash * 23 + this.Title.GetHashCode();
    hash = hash * 23 + this.PublishDate.GetHashCode();
    hash = hash * 23 + this.Description.GetHashCode();
    return hash;
}

But what are the chances of the two types of hash colliding? Certainly, I wouldn't expect it to be 1 in 2 ** 32. Is this a bad idea, and if so, how should I be doing it?

Gluttonous answered 2/7, 2009 at 1:9 Comment(3)

It is more important that your hash algorithm agrees with your equality algorithm than the distribution be uniform. Remember, the purpose of the hash is solely to get a decent distribution in a hash table; as long as you're not massively skewed to one particular bucket, odds are good you'll be fine. If you're concerned, pick a reasonable scenario that the consumer of your object is likely to encounter -- say, putting a few hundred of them in a dictionary, if that's reasonable -- and do some perf testing to see if you get acceptable results. – Prana 2/7, 2009 at 2:15

The most I've ever seen in actual use was ~200, but typical use is <30, so you're probably right. – Gluttonous 2/7, 2009 at 3:2

Heck, with under 30 items, a linear search in a linked list is probably reasonably performant. You could return a hash code of zero always, have 100% chance of collision, and still get acceptable performance. The point of having a good distribution of hash codes is to make performance scale when the dictionary size gets big. You can have a lousy distribution and still get good results if you're only going to put a tiny number of items in the table. – Prana 2/7, 2009 at 18:20

W

5

I don't think there is a problem with the approach you have chosen to use. Worrying 'too much' about hash collisions is almost always an indication of over-thinking the problem; as long as the hash is highly likely to be different you should be fine.

Ultimately you may even want to consider leaving out the Description from your hash anyway if it is reasonable to expect that most of the time objects can be distinguished based on their title and publication date (books?).

You could even consider disregarding the GUID in your hash function altogether, and only use it in the Equals implementation to disambiguate the unlikely(?) case of hash clashes.

Wyn answered 2/7, 2009 at 2:10 Comment(4)

Altho, obviously, the GUID if present, is likely to hash a lot quicker than an arbitrary title string... so it could be a feasible performance optimisation. – Wyn 2/7, 2009 at 2:11

Description needs to be included in equality (and hence in the hash code) – Gluttonous 2/7, 2009 at 2:31

Oh, and for the record, RSS items. – Gluttonous 2/7, 2009 at 2:38

Note that it is perfectly valid for 'Description' to be in 'Equals', but not in 'GetHashCode' ... Your hashing function is allowed to be weaker than the equality test, as long as the hashes of equal items can never be different – Wyn 2/7, 2009 at 5:6

G

10

A very easy hash code method for custom classes is to bitwise XOR each of the fields' hash codes together. It can be as simple as this:

int hash = 0;
hash ^= this.Title.GetHashCode();
hash ^= this.PublishDate.GetHashCode();
hash ^= this.Description.GetHashCode();
return hash;

From the link above:

XOR has the following nice properties:

It does not depend on order of computation.

It does not “waste” bits. If you change even one bit in one of the components, the final value will change.

It is quick, a single cycle on even the most primitive computer.

It preserves uniform distribution. If the two pieces you combine are uniformly distributed so will the combination be. In other words, it does not tend to collapse the range of the digest into a narrower band.

XOR doesn't work well if you expect to have duplicate values in your fields as duplicate values will cancel each other out when XORed. Since you're hashing together three unrelated fields that should not be a problem in this case.

Granulation answered 2/7, 2009 at 1:53 Comment(1)

XOR not depending on the order of computation is a two-edged sword... if you have objects with multiple fields of the same type (for example, two dates), then when these are swapped around the objects will 'look the same' to the hash. – Wyn 2/7, 2009 at 2:4

W

5