Good GetHashCode() override for List of Foo objects respecting the order

Asked 11/11, 2011 at 13:49 Answered 5/7, 2022 at 10:38

EnumerableObject : IEnumerable<Foo>

wraps a List<Foo>

If EnumerableObject a.SequenceEquals( EnumerableObject b), then they are equal.

Therefore, a GetHashCode must be implemented. The problem is XORing each element in the list will return the same hash code for any list with all and only the same elements, regardless of order. This is Okay in terms of it working, but will result in many collisions, which will slow down retrieval, etc.

What is a good, fast GetHashCode method for lists of objects that is order dependent?

Torch answered 11/11, 2011 at 13:49 Comment(5)

Are you sure you want to re-define Equality for a List? It's very rarely a good idea. – Orlan 11/11, 2011 at 14:3

I didn't find where the documentation says that GetHashCode is to be implemented if SequenceEqual is implemented. Are you sure this is needed at all? – Irrawaddy 11/11, 2011 at 14:24

@Irrawaddy It doesn't. I read the question as saying they've defined equality as SequenceEquals entails equality, and want to match that in their hashcode. – Vichyssoise 11/11, 2011 at 16:7

My previous comment assumes that the list and the items in the list don't change after the list is created. – Surefire 20/6, 2014 at 2:36

consider List<T> wouldnt this produce the same value for two lists which just differ by the order of the elements ? – Lignite 14/7, 2018 at 8:51

I'd do it the same way I normally combine hash codes - with an addition and a multiplication:

public override int GetHashCode()
{
    unchecked
    {
        int hash = 19;
        foreach (var foo in foos)
        {
            hash = hash * 31 + foo.GetHashCode();
        }
        return hash;
    }
}

(Note that you shouldn't add anything to the list after this has been used for the key in a hash table of any description, as the hash will change. This also assumes that there are no null entries - if there could be, you need to take account of that.)

Griddle answered 11/11, 2011 at 13:54 Comment(19)

Strange enough, List<T>'s implementation of GetHashCode is just inherited from object. – Irrawaddy 11/11, 2011 at 13:57

Because List<T> equality is not sequence equality. – Torch 11/11, 2011 at 14:20

@MK_Dev: Well, they're both primes, which generally helps. I'll confess to not understanding the maths behind why this hash generally works well. Multiplication by 31 is easily optimizable to a shift and a subtraction, which makes it attractive. There's a page explaining this kind of hash somewhere around, but I can't remember where, I'm afraid :( – Griddle 27/11, 2012 at 6:44

@Jon You're probably referring to this page, which you mentioned in this answer. – Letta 2/12, 2013 at 20:18

@DavidSchwartz: Hmm - I think I've seen a longer page about Bernstein, but that's definitely better than nothing :) – Griddle 2/12, 2013 at 20:22

@JonSkeet a null check in the hash function will add to completeness :) – Louanne 14/12, 2013 at 18:0

@nawfal: I'd rather not distract from the general approach, to be honest. Hopefully anyone with a collection including null references can figure it out... – Griddle 14/12, 2013 at 20:40

@JonSkeet does that solution produce consistent result with List of string across multiple AppDomains? I have a WCF Server where multiple application from various platforms (XP, Vista, Win7 or different .Net Frameworks 3.5 and higher) can connect and in all those situations I need a consistent hashcode from a list of strings is that the case? If not how would I achieve that? – Oneman 17/6, 2016 at 8:28

@RandRandom: You shouldn't be using GetHashCode for that - it's not meant to be consistent across different processes. You should use something like SHA-256 for that. – Griddle 17/6, 2016 at 10:29

@JonSkeet can you share your idea in that question: #37881011 – Oneman 17/6, 2016 at 11:32

What about null vs empty collections? In this case, both will have hash codes of 19. should they be different? Why would you want that to be so? – Contrasty 12/6, 2019 at 8:6

@MuhammadRehanSaeed: What do you mean by a "null collection"? You can't call GetHashCode on a null collection. If foos is null, the code in the answer will throw an exception. If it can be null, then the method should take that into account. Whether null and empty should be considered equal is a context-specific decision. – Griddle 12/6, 2019 at 8:45

@JonSkeet I meant if foos could be null. Why would you differentiate null from an empty collection. Under what context is such a differentiation valid? A colleague is proposing to do so, and I have no answer. #56557628 – Contrasty 12/6, 2019 at 9:41

@MuhammadRehanSaeed: You should ask your colleague rather than us. But if both states are valid, it seems perfectly reasonable to differentiate between them. (Someone carrying an empty box isn't the same as someone not carrying a box at all...) – Griddle 12/6, 2019 at 10:45

why uncheked? – Requisite 19/7, 2021 at 23:31

@JobaDiniz: Because it's entirely normal for hash code operations to overflow, and you really don't want that to throw an exception, even if the rest of the application is in a checked context. – Griddle 20/7, 2021 at 6:6

Sorry if this is a stupid question, but is it safe to assume that the values 19 and 31 are completely arbitrary? – Pole 3/8, 2022 at 15:37

@MikeBruno: No, not completely arbitrary. In particular, multiplying by 31 can be optimized to a "left shift and subtract" - and I believe that both being prime is important for the hash distribution. I confess I don't know much about the maths though. – Griddle 3/8, 2022 at 16:0

@JonSkeet Thanks, that makes sense! Yeah, I'm not so knowledgeable about the mathiplications either – Pole 3/8, 2022 at 17:2

Firstly, double-check that you need a hashcode at all. Are you going to be putting these lists into a hash-mapped structure (e.g. dictionary, hashset, etc)? If not, forget about it.

Now, assuming that you mean that EnumerableObject already overrides Equals(object) (and hopefully therefore also implements IEquatable<EnumerableObject>) for some reason, then this is indeed necessary. You want to balance speed versus bit distribution.

A good starting point is a mult+add or a shift+xor like:

public override int GetHashCode()
{
    int res = 0x2D2816FE;
    foreach(var item in this)
    {
        res = res * 31 + (item == null ? 0 : item.GetHashCode());
    }
    return res;
}

(This assumes that you are using item.Equals() for your sequence equality comparison, if you're using an IEqualityComparer's equals you'll need to call into its hashcode).

From there we can optimise.

If null items are disallowed, remove the null-check (be careful, this will make the code throw if it ever does find a null).

If very large lists are common we need to reduce the number examined, while trying not to result in lots of collisions. Compare the following different implementations:

public override int GetHashCode()
{
    int res = 0x2D2816FE;
    int max = Math.Min(Count, 16);
    for(int i = 0, i != max; ++i)
    {
        var item = this[i];
        res = res * 31 + (item == null ? 0 : item.GetHashCode());
    }
    return res;
}

public override int GetHashCode()
{
    int res = 0x2D2816FE;
    int min = Math.Max(-1, Count - 16);
    for(int i = Count -1, i != min; --i)
    {
        var item = this[i];
        res = res * 31 + (item == null ? 0 : item.GetHashCode());
    }
    return res;
}

public override int GetHashCode()
{
    int res = 0x2D2816FE;
    int step = Count / 16 + 1;
    for(int i = 0, i < Count; i += step)
    {
        var item = this[i];
        res = res * 31 + (item == null ? 0 : item.GetHashCode());
    }
    return res;
}

Each of these restrict the total number of items examined, which speeds execution but risks poorer quality hashes. Which (if any) is best depends on whether collections with the same start or the same end are more likely.

Changing the number 16 above adjusts the balance; smaller is faster but higher is better hash quality with a lower risk of hash collisions.

Edit: And now you can use my implementation of SpookyHash v. 2:

public override int GetHashCode()
{
  var hasher = new SpookyHash();//use methods with seeds if you need to prevent HashDos
  foreach(var item in this)
    hasher.Update(item.GetHashCode());//or relevant feeds of item, etc.
  return hasher.Final().GetHashCode();
}

This will create a much better distribution than mult+add or shift+xor, while also being particularly fast (especially in 64-bit processes as the algorithm is optimised for that, though it works well on 32-bit too).

Vichyssoise answered 11/11, 2011 at 16:1 Comment(8)

Can you explain where 0x2D2816FE came from? Google suggests this is .Net's hash code for an empty string, but I am not sure why that would be a good starting value. Jon Skeet's answer used 19 in a similar way, but I don't understand that either. – Recaption 16/9, 2013 at 20:31

@ChrisNielsen it's arbitary, and yes I just borrowed from that use. The main thing is that it shouldn't be zero so a to have a different code for an empty list and for null (most uses will assign zero to null objects) as those are two particularly common cases. – Vichyssoise 17/9, 2013 at 8:36

@JonHanna you would want a unchecked scope for GetHashCode there. – Louanne 9/11, 2013 at 8:44

If the list will never be mutated, the time to hash everything once will be a fixed multiple of the time spent building the list. I would favor hashing every item but caching the hash value. If the list will support limited forms of mutation (e.g. Add only), it may be possible to maintain an incrementally-updated hash value as the list grows. Also, in many cases I'd suggest keeping more than 32 bits of state within the hashing loop, and rendering the result to 32 bits at the end. That will help reduce the danger of "systematic" collisions. – Lineman 7/1, 2014 at 19:38

@Lineman almost all uses of GetHashCode() will memoise the hash obtained, making the value of caching it within the method limited. – Vichyssoise 7/1, 2014 at 20:33

@JonHanna: It's true that things like Dictionary will memoise the hash values of objects that are added; since adding an item requires calling GetHashCode() to figure out where it goes, storing the value poses minimal extra overhead. On the other hand, if an object gets stored in or checked against more than one dictionary, or if a dictionary is asked repeatedly for an item it does not contain, each such action will result in another GetHashCode call. Unless one knows the list will be small and repeated calls to nested items' GetHashCode() will be fast, caching would seem wise. – Lineman 7/1, 2014 at 20:50

@Lineman that's assuming it is in fact the same instance that is hashed each time, when if anything the point of overriding is that it might not be, and often won't. I'd also consider that if calling GetHashCode() is particularly expensive, the problem is elsewhere, which funnily enough has been a scratch I've been itching of late (I wouldn't suggest the answer above any more, but something based on bitbucket.org/JonHanna/spookilysharp or at least on what it'll soon be after the initial ironing-out-the-wrinkles period. – Vichyssoise 7/1, 2014 at 21:15

@JonHanna: The point of overriding is that it won't always be the same instance. The point of caching is that GetHashCode() may sometimes end up getting called multiple times on the same instance. That doesn't have to happen very often for caching to be a "win", and in some cases it can make a really huge difference (if the items cache their hash codes, it may not matter so much if the collection does, but if caching isn't considered worthwhile for the collection, why should one expect that it was considered worthwhile for the items)? – Lineman 7/1, 2014 at 22:31

The .GetHashCode() method usually just returns a hash based on the object reference (pointer address). This is because calculating the hash code of every item in an enumerable list can be very time intensive. Instead of overwriting the existing behaviour, I prefer to use an extension method and use it only where the hash code needs to be deterministically determined:

public static class EnumerableExtensions
{
    public static int GetSequenceHashCode<TItem>(this IEnumerable<TItem> list)
    {
        if (list == null) return 0;
        const int seedValue = 0x2D2816FE;
        const int primeNumber = 397;
        return list.Aggregate(seedValue, (current, item) => (current * primeNumber) + (Equals(item, default(TItem)) ? 0 : item.GetHashCode()));
    }
}

Febrific answered 10/1, 2018 at 16:55 Comment(0)

My extension method with null handling based on Jon Skeet answer:

#region UTILS
/// <summary>
/// Utils
/// </summary>
internal static class UTILS
{
    #region GetHashCodeByItems
    /// <summary>
    /// Hash code depending on the content and order of the elements of the collection
    /// </summary>
    /// <param name="lst">Collection</param>
    /// <typeparam name="T">The type of items in the collection</typeparam>
    /// <returns>Hash code</returns>
    internal static int GetHashCodeByItems<T>(this IEnumerable<T> lst)
    {
        unchecked
        {
            int hash = 19;
            foreach (T item in lst)
            {
                hash = hash * 31 + (item != null ? item.GetHashCode() : 1);
            }
            return hash;
        }
    }
    #endregion
}
#endregion

Crumble answered 5/7, 2022 at 10:38 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags