Good GetHashCode() override for List of Foo objects respecting the order
Asked Answered
T

4

46

EnumerableObject : IEnumerable<Foo>

wraps a List<Foo>

If EnumerableObject a.SequenceEquals( EnumerableObject b), then they are equal.

Therefore, a GetHashCode must be implemented. The problem is XORing each element in the list will return the same hash code for any list with all and only the same elements, regardless of order. This is Okay in terms of it working, but will result in many collisions, which will slow down retrieval, etc.

What is a good, fast GetHashCode method for lists of objects that is order dependent?

Torch answered 11/11, 2011 at 13:49 Comment(5)
Are you sure you want to re-define Equality for a List? It's very rarely a good idea.Orlan
I didn't find where the documentation says that GetHashCode is to be implemented if SequenceEqual is implemented. Are you sure this is needed at all?Irrawaddy
@Irrawaddy It doesn't. I read the question as saying they've defined equality as SequenceEquals entails equality, and want to match that in their hashcode.Vichyssoise
My previous comment assumes that the list and the items in the list don't change after the list is created.Surefire
consider List<T> wouldnt this produce the same value for two lists which just differ by the order of the elements ?Lignite
G
79

I'd do it the same way I normally combine hash codes - with an addition and a multiplication:

public override int GetHashCode()
{
    unchecked
    {
        int hash = 19;
        foreach (var foo in foos)
        {
            hash = hash * 31 + foo.GetHashCode();
        }
        return hash;
    }
}

(Note that you shouldn't add anything to the list after this has been used for the key in a hash table of any description, as the hash will change. This also assumes that there are no null entries - if there could be, you need to take account of that.)

Griddle answered 11/11, 2011 at 13:54 Comment(19)
Strange enough, List<T>'s implementation of GetHashCode is just inherited from object.Irrawaddy
Because List<T> equality is not sequence equality.Torch
@MK_Dev: Well, they're both primes, which generally helps. I'll confess to not understanding the maths behind why this hash generally works well. Multiplication by 31 is easily optimizable to a shift and a subtraction, which makes it attractive. There's a page explaining this kind of hash somewhere around, but I can't remember where, I'm afraid :(Griddle
@Jon You're probably referring to this page, which you mentioned in this answer.Letta
@DavidSchwartz: Hmm - I think I've seen a longer page about Bernstein, but that's definitely better than nothing :)Griddle
@JonSkeet a null check in the hash function will add to completeness :)Louanne
@nawfal: I'd rather not distract from the general approach, to be honest. Hopefully anyone with a collection including null references can figure it out...Griddle
@JonSkeet does that solution produce consistent result with List of string across multiple AppDomains? I have a WCF Server where multiple application from various platforms (XP, Vista, Win7 or different .Net Frameworks 3.5 and higher) can connect and in all those situations I need a consistent hashcode from a list of strings is that the case? If not how would I achieve that?Oneman
@RandRandom: You shouldn't be using GetHashCode for that - it's not meant to be consistent across different processes. You should use something like SHA-256 for that.Griddle
@JonSkeet can you share your idea in that question: #37881011Oneman
What about null vs empty collections? In this case, both will have hash codes of 19. should they be different? Why would you want that to be so?Contrasty
@MuhammadRehanSaeed: What do you mean by a "null collection"? You can't call GetHashCode on a null collection. If foos is null, the code in the answer will throw an exception. If it can be null, then the method should take that into account. Whether null and empty should be considered equal is a context-specific decision.Griddle
@JonSkeet I meant if foos could be null. Why would you differentiate null from an empty collection. Under what context is such a differentiation valid? A colleague is proposing to do so, and I have no answer. #56557628Contrasty
@MuhammadRehanSaeed: You should ask your colleague rather than us. But if both states are valid, it seems perfectly reasonable to differentiate between them. (Someone carrying an empty box isn't the same as someone not carrying a box at all...)Griddle
why uncheked?Requisite
@JobaDiniz: Because it's entirely normal for hash code operations to overflow, and you really don't want that to throw an exception, even if the rest of the application is in a checked context.Griddle
Sorry if this is a stupid question, but is it safe to assume that the values 19 and 31 are completely arbitrary?Pole
@MikeBruno: No, not completely arbitrary. In particular, multiplying by 31 can be optimized to a "left shift and subtract" - and I believe that both being prime is important for the hash distribution. I confess I don't know much about the maths though.Griddle
@JonSkeet Thanks, that makes sense! Yeah, I'm not so knowledgeable about the mathiplications eitherPole
V
13

Firstly, double-check that you need a hashcode at all. Are you going to be putting these lists into a hash-mapped structure (e.g. dictionary, hashset, etc)? If not, forget about it.

Now, assuming that you mean that EnumerableObject already overrides Equals(object) (and hopefully therefore also implements IEquatable<EnumerableObject>) for some reason, then this is indeed necessary. You want to balance speed versus bit distribution.

A good starting point is a mult+add or a shift+xor like:

public override int GetHashCode()
{
    int res = 0x2D2816FE;
    foreach(var item in this)
    {
        res = res * 31 + (item == null ? 0 : item.GetHashCode());
    }
    return res;
}

(This assumes that you are using item.Equals() for your sequence equality comparison, if you're using an IEqualityComparer's equals you'll need to call into its hashcode).

From there we can optimise.

If null items are disallowed, remove the null-check (be careful, this will make the code throw if it ever does find a null).

If very large lists are common we need to reduce the number examined, while trying not to result in lots of collisions. Compare the following different implementations:

public override int GetHashCode()
{
    int res = 0x2D2816FE;
    int max = Math.Min(Count, 16);
    for(int i = 0, i != max; ++i)
    {
        var item = this[i];
        res = res * 31 + (item == null ? 0 : item.GetHashCode());
    }
    return res;
}

public override int GetHashCode()
{
    int res = 0x2D2816FE;
    int min = Math.Max(-1, Count - 16);
    for(int i = Count -1, i != min; --i)
    {
        var item = this[i];
        res = res * 31 + (item == null ? 0 : item.GetHashCode());
    }
    return res;
}

public override int GetHashCode()
{
    int res = 0x2D2816FE;
    int step = Count / 16 + 1;
    for(int i = 0, i < Count; i += step)
    {
        var item = this[i];
        res = res * 31 + (item == null ? 0 : item.GetHashCode());
    }
    return res;
}

Each of these restrict the total number of items examined, which speeds execution but risks poorer quality hashes. Which (if any) is best depends on whether collections with the same start or the same end are more likely.

Changing the number 16 above adjusts the balance; smaller is faster but higher is better hash quality with a lower risk of hash collisions.

Edit: And now you can use my implementation of SpookyHash v. 2:

public override int GetHashCode()
{
  var hasher = new SpookyHash();//use methods with seeds if you need to prevent HashDos
  foreach(var item in this)
    hasher.Update(item.GetHashCode());//or relevant feeds of item, etc.
  return hasher.Final().GetHashCode();
}

This will create a much better distribution than mult+add or shift+xor, while also being particularly fast (especially in 64-bit processes as the algorithm is optimised for that, though it works well on 32-bit too).

Vichyssoise answered 11/11, 2011 at 16:1 Comment(8)
Can you explain where 0x2D2816FE came from? Google suggests this is .Net's hash code for an empty string, but I am not sure why that would be a good starting value. Jon Skeet's answer used 19 in a similar way, but I don't understand that either.Recaption
@ChrisNielsen it's arbitary, and yes I just borrowed from that use. The main thing is that it shouldn't be zero so a to have a different code for an empty list and for null (most uses will assign zero to null objects) as those are two particularly common cases.Vichyssoise
@JonHanna you would want a unchecked scope for GetHashCode there.Louanne
If the list will never be mutated, the time to hash everything once will be a fixed multiple of the time spent building the list. I would favor hashing every item but caching the hash value. If the list will support limited forms of mutation (e.g. Add only), it may be possible to maintain an incrementally-updated hash value as the list grows. Also, in many cases I'd suggest keeping more than 32 bits of state within the hashing loop, and rendering the result to 32 bits at the end. That will help reduce the danger of "systematic" collisions.Lineman
@Lineman almost all uses of GetHashCode() will memoise the hash obtained, making the value of caching it within the method limited.Vichyssoise
@JonHanna: It's true that things like Dictionary will memoise the hash values of objects that are added; since adding an item requires calling GetHashCode() to figure out where it goes, storing the value poses minimal extra overhead. On the other hand, if an object gets stored in or checked against more than one dictionary, or if a dictionary is asked repeatedly for an item it does not contain, each such action will result in another GetHashCode call. Unless one knows the list will be small and repeated calls to nested items' GetHashCode() will be fast, caching would seem wise.Lineman
@Lineman that's assuming it is in fact the same instance that is hashed each time, when if anything the point of overriding is that it might not be, and often won't. I'd also consider that if calling GetHashCode() is particularly expensive, the problem is elsewhere, which funnily enough has been a scratch I've been itching of late (I wouldn't suggest the answer above any more, but something based on bitbucket.org/JonHanna/spookilysharp or at least on what it'll soon be after the initial ironing-out-the-wrinkles period.Vichyssoise
@JonHanna: The point of overriding is that it won't always be the same instance. The point of caching is that GetHashCode() may sometimes end up getting called multiple times on the same instance. That doesn't have to happen very often for caching to be a "win", and in some cases it can make a really huge difference (if the items cache their hash codes, it may not matter so much if the collection does, but if caching isn't considered worthwhile for the collection, why should one expect that it was considered worthwhile for the items)?Lineman
F
9

The .GetHashCode() method usually just returns a hash based on the object reference (pointer address). This is because calculating the hash code of every item in an enumerable list can be very time intensive. Instead of overwriting the existing behaviour, I prefer to use an extension method and use it only where the hash code needs to be deterministically determined:

public static class EnumerableExtensions
{
    public static int GetSequenceHashCode<TItem>(this IEnumerable<TItem> list)
    {
        if (list == null) return 0;
        const int seedValue = 0x2D2816FE;
        const int primeNumber = 397;
        return list.Aggregate(seedValue, (current, item) => (current * primeNumber) + (Equals(item, default(TItem)) ? 0 : item.GetHashCode()));
    }
}
Febrific answered 10/1, 2018 at 16:55 Comment(0)
C
2

My extension method with null handling based on Jon Skeet answer:

#region UTILS
/// <summary>
/// Utils
/// </summary>
internal static class UTILS
{
    #region GetHashCodeByItems
    /// <summary>
    /// Hash code depending on the content and order of the elements of the collection
    /// </summary>
    /// <param name="lst">Collection</param>
    /// <typeparam name="T">The type of items in the collection</typeparam>
    /// <returns>Hash code</returns>
    internal static int GetHashCodeByItems<T>(this IEnumerable<T> lst)
    {
        unchecked
        {
            int hash = 19;
            foreach (T item in lst)
            {
                hash = hash * 31 + (item != null ? item.GetHashCode() : 1);
            }
            return hash;
        }
    }
    #endregion
}
#endregion
Crumble answered 5/7, 2022 at 10:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.