Quick and Simple Hash Code Combinations
Asked Answered
H

10

112

Can people recommend quick and simple ways to combine the hash codes of two objects. I am not too worried about collisions since I have a Hash Table which will handle that efficiently I just want something that generates a code quickly as possible.

Reading around SO and the web there seem to be a few main candidates:

  1. XORing
  2. XORing with Prime Multiplication
  3. Simple numeric operations like multiplication/division (with overflow checking or wrapping around)
  4. Building a String and then using the String classes Hash Code method

What would people recommend and why?

Hanson answered 29/10, 2009 at 21:53 Comment(0)
K
162

I would personally avoid XOR - it means that any two equal values will result in 0 - so hash(1, 1) == hash(2, 2) == hash(3, 3) etc. Also hash(5, 0) == hash(0, 5) etc which may come up occasionally. I have deliberately used it for set hashing - if you want to hash a sequence of items and you don't care about the ordering, it's nice.

I usually use:

unchecked
{
    int hash = 17;
    hash = hash * 31 + firstField.GetHashCode();
    hash = hash * 31 + secondField.GetHashCode();
    return hash;
}

That's the form that Josh Bloch suggests in Effective Java. Last time I answered a similar question I managed to find an article where this was discussed in detail - IIRC, no-one really knows why it works well, but it does. It's also easy to remember, easy to implement, and easy to extend to any number of fields.

Kimmi answered 29/10, 2009 at 22:11 Comment(24)
Yeah that was my concern about XORing, in the type of data I'm pairing it's fairly unlikely to be pairing too equal items but not impossibleHanson
Looks like Dan Bernstein's (or Chris Torek's) hash, just with different constants. Nobody knows why that works well either.Flunky
@RobV: I prefer not to have to think if I don't have to. I use this hash even when I could get away with XOR, just to avoid having to wonder whether it's safe or not :)Kimmi
what are the firstField and secondField that you talk about?Causal
@SuperString: the first and second fields you want to hash. You then keep going for the other fields.Kimmi
A word of warning, this is (a variation of) the Berstein hash, and because nobody knows why it does well in tests it is not advisable when hashing is critical. See eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx. Also, you should wrap this code in an unchecked { } block. GetHashCode() should not throw any exceptions.Dissimilate
This is so mysterious - why 17? why 31? is it important that they are primes?Ear
@tofutim: The 31 is a good choice as multiplication by 31 can be optimized to a shift and subtract. Whether it is optimized that way depends on the platform. As for why those numbers work well for hashing - as Henk says, it's a bit of a mystery.Kimmi
Why do you prime (no pun intended) the pump with 17 and 31? Why isn't it just return firstField.GetHashCode() * 31 + secondField.GetHashCode();? Or is that part of what no one knows?Capitate
@PatrickM: I follow Josh Bloch (or some variant), basically. It does mean that it's harder to end up with a hash code of 0...Kimmi
But is a hash code of 0 any worse than a hash code of 16337? (17 * 31 * 31, which is amusingly close to 3L337...) I suppose having more multiplications means you have bigger hash codes, on the whole. Is there a similar improvement in distribution?Capitate
@PatrickM: I think hash codes of 0 are more likely to occur between different types, raising the chance of a collision. For example, two objects with different numbers of fields, but where all the fields give a hash code of 0, would have different hashes under this scheme. But this is a pretty tenuous argument, I'm willing to admit :)Kimmi
The Bernstein hash (DJB, DJB2) uses 33, not 31. And 5381, not 17. But yeah, otherwise very similar. I've used DJB2 for both string hashing and hashcode combination and it's worked fine for me. DJB2 and other algorithms can be found at cse.yorku.ca/~oz/hash.htmlPrintery
@Jon can you please comment on Special Sauce's answer? I'd like to know what your thoughts are on it.Devastating
@rory.ap: I think it's an excellent piece of work, and I'd be perfectly happy to use those numbers. While I hate to admit to using constants "because someone else said to" that's basically what the 17/31 pair is about.Kimmi
"no body knows" feels like higher beings know it and we cant comprehend it!Juvenescent
Starting with .NET Core 2.1 you can use the System.HashCode type's Combine method to do this learn.microsoft.com/en-us/dotnet/api/system.hashcode.combineBring
Hi, That's seems to be a great way to get a good distribution of values. The only thing that bothers me a lot: the process is order dependant. Which mean that you will get different Hash depending on the order you feed them. Thats is prone to error. You can add a comparison at start on both inner hash to solve the problem (to ensure to calculate the hash based on the smaller or bigger inner hash. But you have to keep in mind that is good only for a pair, not more.Forsake
@EricOuellet: I would say that being order-dependent is a good thing. You don't want (1, 0) and (0, 1) to give the same hash code, for example, as that gives more hash collisions.Kimmi
@JonSkeet, I parlty agree. I would say that for some case it is a good thing but it really depends on the situation. I think that it would be really important to bring a clear distinction on the 2 very specific aspects of the effect on the way to combine hash code. I think that it is important (distinction between order dependant and order independant) and many peoples seems to under estimate it. It is very case specific.Forsake
@EricOuellet: It would only be a good thing for it to be order-independent if equality was also defined in an order-independent way, which is rare in my experience.Kimmi
I'm still unable to find System.HashCode - I get a compiler error when I try to use it.Jota
@mushi: Which platform are you targeting?Kimmi
Android/Unity. I later realised github.com/GlitchEnzo/NuGetForUnity exists, so I now have what I need! Except I don't know if it compiles to Android or not.Jota
C
103

If you are using .NET Core 2.1 or later or .NET Framework 4.6.1 or later, consider using the System.HashCode struct to help with producing composite hash codes. It has two modes of operation: Add and Combine.

An example using Combine, which is usually simpler and works for up to eight items:

public override int GetHashCode()
{
    return HashCode.Combine(object1, object2);
}

An example of using Add:

public override int GetHashCode()
{
    var hash = new HashCode();
    hash.Add(this.object1);
    hash.Add(this.object2);
    return hash.ToHashCode();
}

Pros:

  • Part of .NET itself, as of .NET Core 2.1/.NET Standard 2.1 (though, see con below)
    • For .NET Framework 4.6.1 and later, the Microsoft.Bcl.HashCode NuGet package can be used to backport this type.
  • Looks to have good performance and mixing characteristics, based on the work the author and the reviewers did before merging this into the corefx repo
  • Handles nulls automatically
  • Overloads that take IEqualityComparer instances

Cons:

Cooe answered 6/8, 2018 at 22:35 Comment(9)
You can reference nuget.org/packages/Microsoft.Bcl.HashCode to make it working on .NET Framework 4.6.1 or .NET Standard 2.0, officially.Prefiguration
System.HashCode uses xxHash32 (github.com/Cyan4973/xxHash)Fat
Does Bcl.HashCode generate stable hashes for strings?Snowstorm
@RedRidingHood, HashCode is not the right type to use for a hash code that gets persisted, which is what I assume you mean by "stable". HashCode values are only stable within the same instance of a process. This has been the documented behavior of .NET GetHashCode and related values for a very, very long time. Though, .NET Core started using a per-process seed to enforce this. If you have more questions, I suggest you post it as a stand-alone question.Cooe
Thanks, I was aware of the Net Core behavior, I just wasn't sure if the Bcl package was different then what Net Core is using.Snowstorm
So i've used this before fine for 2 fields, 3 fields... What about 10? 30?Inexpressible
@James: follow the example using Add for lots of fields. You "accumulate" values in each call to Add and produce a final hashcode by calling ToHashCode.Cooe
The question is more about effectiveness, and if this simple method breaks down or could be improved when combining many hashesInexpressible
@James, the current implementation use xxHash for mixing the input hashes together. xxHash has good mixing characteristics. I doubt you'll notice anything untoward as the number of hashes you combine increases. This is in contrast to something like a naive XOR hash which will "lose" an input value if the same value is added later. I also doubt that any future implementations would regress this.Cooe
S
67

While the template outlined in Jon Skeet's answer works well in general as a hash function family, the choice of the constants is important and the seed of 17 and factor of 31 as noted in the answer do not work well at all for common use cases. In most use cases, the hashed values are much closer to zero than int.MaxValue, and the number of items being jointly hashed are a few dozen or less.

For hashing an integer tuple {x, y} where -1000 <= x <= 1000 and -1000 <= y <= 1000, it has an abysmal collision rate of almost 98.5%. For example, {1, 0} -> {0, 31}, {1, 1} -> {0, 32}, etc. If we expand the coverage to also include n-tuples where 3 <= n <= 25, it does less terrible with a collision rate of about 38%. But we can do much better.

public static int CustomHash(int seed, int factor, params int[] vals)
{
    int hash = seed;
    foreach (int i in vals)
    {
        hash = (hash * factor) + i;
    }
    return hash;
}

I wrote a Monte Carlo sampling search loop that tested the method above with various values for seed and factor over various random n-tuples of random integers i. Allowed ranges were 2 <= n <= 25 (where n was random but biased toward the lower end of the range) and -1000 <= i <= 1000. At least 12 million unique collision tests were performed for each seed and factor pair.

After about 7 hours running, the best pair found (where the seed and factor were both limited to 4 digits or less) was: seed = 1009, factor = 9176, with a collision rate of 0.1131%. In the 5- and 6-digit areas, even better options exist. But I selected the top 4-digit performer for brevity, and it peforms quite well in all common int and char hashing scenarios. It also seems to work fine with integers of much greater magnitudes.

It is worth noting that "being prime" did not seem to be a general prerequisite for good performance as a seed and/or factor although it likely helps. 1009 noted above is in fact prime, but 9176 is not. I explicitly tested variations on this where I changed factor to various primes near 9176 (while leaving seed = 1009) and they all performed worse than the above solution.

Lastly, I also compared against the generic ReSharper recommendation function family of hash = (hash * factor) ^ i; and the original CustomHash() as noted above seriously outperforms it. The ReSharper XOR style seems to have collision rates in the 20-30% range for common use case assumptions and should not be used in my opinion.

Sacci answered 30/11, 2015 at 19:25 Comment(8)
Wow. I love the work that went into this answer. Impressive, well done!Love
Seems to be the best but there is 2 remarks: First and easy one, why not moving "seed" and "factor" to the end and give them a default value (1009 and 9176) where it shuold do the job for most peoples. Second point: like Jon Skeet algo, that is order dependant and you can get a different hash if you feed in a different order. I wonder if it would not be safer to sort that array first in order to ensure to have the same final hash at the end either if you feed the algo in different way. That would become safer.Forsake
@EricOuellet Because the params int[] vals must come at the end of all function arguments, I was not able to make the seed and factor defaulted parameters. If you don't care about the params syntax convenience, you could remove it and then reorder the parameters to allow the defaults as you suggested.Sacci
@EricOuellet The default hashing for an array should be to consider permutations (this is the more general case) and thus the hash would be different for different orderings (just as the hash for string "abc" is different than the hash for "acb"). If you specifically wanted a hash function for combinations only, you should probably accept a HashSet<int> argument (assuming no duplicates). Otherwise you could rename the function to CustomHashCombination() to prevent confusion, and do the internal pre-sort as suggested.Sacci
I like this answer, but I wouldn't use params because it has to allocate an array on every call. So it may be faster computations-wise, but it creates GC pressure for later.Noddle
What were the 5 and 6-digit numbers that got generated? Will 1009 and 9176 work well for hashing in a table of 2^28 (int.MaxValue / 16) values?Jota
Not sure if I can trust these constants considering this test has a hard limit [-1000..1000] range for i. Wouldn't it be better, if possible, to run it allowing all int values, but using random ints from a bell-curve distribution centered in 0? You could also add ints coming from GetHashCode of floats and doubles in a bell-curve distribution centered in 0.0. It'd be great if you published your testing algorithm in a gist or something so people can understand it better (and maybe tweak it for specific cases).Dickey
@Dickey Larger int values are not really a problem because they tend to mix even faster (over the unchecked int.MaxValue modulo). You could make seed and factor very large random constants, and get great mixing / low collisions immediately with almost any seed & factor values you tried. The whole point of my exercise was to find a seed & factor combination that worked well with a low number of digits to make a nice readable, convenient snippet.Sacci
I
26

Use the combination logic in tuple. The example is using c#7 tuples.

(field1, field2).GetHashCode();
Interlineate answered 24/11, 2017 at 22:38 Comment(6)
Nice idea though I suspect that this might have issues with GC churn since you are implicitly creating a short lived objectHanson
@Hanson Tuples are value types, so they are stack allocated and exert no GC pressure.Shagbark
One problem... (0,1,2).GetHashCode() and (0,0,1,2).GetHashCode() both yield the same value: 35. Whereas the method in the most upvoted answer yields unique values 0, 1, 2: 506480 and 0, 0, 1, 2: 15699890Improvvisatore
Hashcodes aren't guaranteed to be unique. You found a case where it isn't... It doesn't make it a bad choice unless there are a lot of collisions(in that case it would be a good idea to submit a bug). I personally prefer to use something from the framework instead of implementing something different.Interlineate
It's actually ValueTuple type of a struct (MSDN). Be careful that Tuple type is a class and it has GC pressure. I like this way. Internally it's similar to @Stipo 's post but very easy to understand and to review. It would be good choice in most cases.Drawstring
Best answer, works on both .net 4.5 and on core. Use the System.ValueTuple nuget package on .net 4.5, and on core it's built in.Pahari
T
21

I presume that .NET Framework team did a decent job in testing their System.String.GetHashCode() implementation, so I would use it:

// System.String.GetHashCode(): http://referencesource.microsoft.com/#mscorlib/system/string.cs,0a17bbac4851d0d4
// System.Web.Util.StringUtil.GetStringHashCode(System.String): http://referencesource.microsoft.com/#System.Web/Util/StringUtil.cs,c97063570b4e791a
public static int CombineHashCodes(IEnumerable<int> hashCodes)
{
    int hash1 = (5381 << 16) + 5381;
    int hash2 = hash1;

    int i = 0;
    foreach (var hashCode in hashCodes)
    {
        if (i % 2 == 0)
            hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ hashCode;
        else
            hash2 = ((hash2 << 5) + hash2 + (hash2 >> 27)) ^ hashCode;

        ++i;
    }

    return hash1 + (hash2 * 1566083941);
}

Another implementation is from System.Web.Util.HashCodeCombiner.CombineHashCodes(System.Int32, System.Int32) and System.Array.CombineHashCodes(System.Int32, System.Int32) methods. This one is simpler, but probably doesn't have such a good distribution as the method above:

// System.Web.Util.HashCodeCombiner.CombineHashCodes(System.Int32, System.Int32): http://referencesource.microsoft.com/#System.Web/Util/HashCodeCombiner.cs,21fb74ad8bb43f6b
// System.Array.CombineHashCodes(System.Int32, System.Int32): http://referencesource.microsoft.com/#mscorlib/system/array.cs,87d117c8cc772cca
public static int CombineHashCodes(IEnumerable<int> hashCodes)
{
    int hash = 5381;

    foreach (var hashCode in hashCodes)
        hash = ((hash << 5) + hash) ^ hashCode;

    return hash;
}
Thunderstorm answered 11/12, 2015 at 17:55 Comment(0)
M
3

This is a repackaging of Special Sauce's brilliantly researched solution.
It makes use of Value Tuples (ITuple).
This allows defaults for the parameters seed and factor.

public static int CombineHashes(this ITuple tupled, int seed=1009, int factor=9176)
{
    var hash = seed;

    for (var i = 0; i < tupled.Length; i++)
    {
        unchecked
        {
            hash = hash * factor + tupled[i].GetHashCode();
        }
    }

    return hash;
}

Usage:

var hash1 = ("Foo", "Bar", 42).CombineHashes();    
var hash2 = ("Jon", "Skeet", "Constants").CombineHashes(seed=17, factor=31);
Madgemadhouse answered 20/1, 2020 at 11:37 Comment(0)
P
0

If you're looking for speed and don't have too many collisions, then XOR is fastest. To prevent a clustering around zero, you could do something like this:

finalHash = hash1 ^ hash2;
return finalHash != 0 ? finalHash : hash1;

Of course, some prototyping ought to give you an idea of performance and clustering.

Phineas answered 30/10, 2009 at 0:34 Comment(0)
D
-1

If your input hashes are the same size, evenly distributed and not related to each other then an XOR should be OK. Plus it's fast.

The situation I'm suggesting this for is where you want to do

H = hash(A) ^ hash(B); // A and B are different types, so there's no way A == B.

of course, if A and B can be expected to hash to the same value with a reasonable (non-negligible) probability, then you should not use XOR in this way.

Dermato answered 29/10, 2009 at 21:57 Comment(1)
how would I tell whether my hash codes are evenly distributed, is there an easy benchmark to do for this? I know the collision rate is pretty low but does that necessarily correspond to an even distribution?Hanson
D
-4

Assuming you have a relevant toString() function (where your different fields shall appear), I would just return its hashcode:

this.toString().hashCode();

This is not very fast, but it should avoid collisions quite well.

Decastere answered 29/12, 2018 at 12:57 Comment(0)
K
-13

I would recommend using the built-in hash functions in System.Security.Cryptography rather than rolling your own.

Kussell answered 29/10, 2009 at 23:5 Comment(1)
No, They have a very different purpose and break the rule that GetHashCode should be fast.Dissimilate

© 2022 - 2024 — McMap. All rights reserved.