On the importance of GetHashCode
Others have already commented on the fact that any custom IEqualityComparer<T>
implementation should really include a GetHashCode
method; but nobody's bothered to explain why in any detail.
Here's why. Your question specifically mentions the LINQ extension methods; nearly all of these rely on hash codes to work properly, because they utilize hash tables internally for efficiency.
Take Distinct
, for example. Consider the implications of this extension method if all it utilized were an Equals
method. How do you determine whether an item's already been scanned in a sequence if you only have Equals
? You enumerate over the entire collection of values you've already looked at and check for a match. This would result in Distinct
using a worst-case O(N2) algorithm instead of an O(N) one!
Fortunately, this isn't the case. Distinct
doesn't just use Equals
; it uses GetHashCode
as well. In fact, it absolutely does not work properly without an IEqualityComparer<T>
that supplies a proper GetHashCode
. Below is a contrived example illustrating this.
Say I have the following type:
class Value
{
public string Name { get; private set; }
public int Number { get; private set; }
public Value(string name, int number)
{
Name = name;
Number = number;
}
public override string ToString()
{
return string.Format("{0}: {1}", Name, Number);
}
}
Now say I have a List<Value>
and I want to find all of the elements with a distinct name. This is a perfect use case for Distinct
using a custom equality comparer. So let's use the Comparer<T>
class from Aku's answer:
var comparer = new Comparer<Value>((x, y) => x.Name == y.Name);
Now, if we have a bunch of Value
elements with the same Name
property, they should all collapse into one value returned by Distinct
, right? Let's see...
var values = new List<Value>();
var random = new Random();
for (int i = 0; i < 10; ++i)
{
values.Add("x", random.Next());
}
var distinct = values.Distinct(comparer);
foreach (Value x in distinct)
{
Console.WriteLine(x);
}
Output:
x: 1346013431
x: 1388845717
x: 1576754134
x: 1104067189
x: 1144789201
x: 1862076501
x: 1573781440
x: 646797592
x: 655632802
x: 1206819377
Hmm, that didn't work, did it?
What about GroupBy
? Let's try that:
var grouped = values.GroupBy(x => x, comparer);
foreach (IGrouping<Value> g in grouped)
{
Console.WriteLine("[KEY: '{0}']", g);
foreach (Value x in g)
{
Console.WriteLine(x);
}
}
Output:
[KEY = 'x: 1346013431']
x: 1346013431
[KEY = 'x: 1388845717']
x: 1388845717
[KEY = 'x: 1576754134']
x: 1576754134
[KEY = 'x: 1104067189']
x: 1104067189
[KEY = 'x: 1144789201']
x: 1144789201
[KEY = 'x: 1862076501']
x: 1862076501
[KEY = 'x: 1573781440']
x: 1573781440
[KEY = 'x: 646797592']
x: 646797592
[KEY = 'x: 655632802']
x: 655632802
[KEY = 'x: 1206819377']
x: 1206819377
Again: didn't work.
If you think about it, it would make sense for Distinct
to use a HashSet<T>
(or equivalent) internally, and for GroupBy
to use something like a Dictionary<TKey, List<T>>
internally. Could this explain why these methods don't work? Let's try this:
var uniqueValues = new HashSet<Value>(values, comparer);
foreach (Value x in uniqueValues)
{
Console.WriteLine(x);
}
Output:
x: 1346013431
x: 1388845717
x: 1576754134
x: 1104067189
x: 1144789201
x: 1862076501
x: 1573781440
x: 646797592
x: 655632802
x: 1206819377
Yeah... starting to make sense?
Hopefully from these examples it's clear why including an appropriate GetHashCode
in any IEqualityComparer<T>
implementation is so important.
Original answer
Expanding on orip's answer:
There are a couple of improvements that can be made here.
- First, I'd take a
Func<T, TKey>
instead of Func<T, object>
; this will prevent boxing of value type keys in the actual keyExtractor
itself.
- Second, I'd actually add a
where TKey : IEquatable<TKey>
constraint; this will prevent boxing in the Equals
call (object.Equals
takes an object
parameter; you need an IEquatable<TKey>
implementation to take a TKey
parameter without boxing it). Clearly this may pose too severe a restriction, so you could make a base class without the constraint and a derived class with it.
Here's what the resulting code might look like:
public class KeyEqualityComparer<T, TKey> : IEqualityComparer<T>
{
protected readonly Func<T, TKey> keyExtractor;
public KeyEqualityComparer(Func<T, TKey> keyExtractor)
{
this.keyExtractor = keyExtractor;
}
public virtual bool Equals(T x, T y)
{
return this.keyExtractor(x).Equals(this.keyExtractor(y));
}
public int GetHashCode(T obj)
{
return this.keyExtractor(obj).GetHashCode();
}
}
public class StrictKeyEqualityComparer<T, TKey> : KeyEqualityComparer<T, TKey>
where TKey : IEquatable<TKey>
{
public StrictKeyEqualityComparer(Func<T, TKey> keyExtractor)
: base(keyExtractor)
{ }
public override bool Equals(T x, T y)
{
// This will use the overload that accepts a TKey parameter
// instead of an object parameter.
return this.keyExtractor(x).Equals(this.keyExtractor(y));
}
}
IEqualityComparer<T>
that leavesGetHashCode
out is just straight-up broken. – Aurlie