Order-independent Hash Algorithm
Asked Answered
B

4

22

I am currently working on a collection library for my custom programming language. I already have several data types (Collection, List, Map, Set) and implementations for them (mutable and immutable), but what I was missing so far was hashCode and equals. While these are no problem for Lists as they are ordered collections, the play a special role for Sets and Maps. Two Sets are considered equal if they have the same size and the same elements, and the order in which the Sets maintain them should not make a difference in their equality. Because of the equals-hashCode-contract, the hashCode implementation also has to reflect this behavior, meaning that two sets with the same elements but different ordering should have the same hash code. (The same applies for Maps, which are technically a Set of Key-Value-Pairs)

Example (Pseudocode):

let set1: Set<String> = [ "a", "b", "c" ]
let set2: Set<String> = [ "b", "c", "a" ]
set1 == set2       // should return true
set1.hashCode == set2.hashCode // should also return true

How would I implement a reasonably good hash algorithm for which the hashCodes in the above example return the same value?

Baylor answered 9/6, 2015 at 14:23 Comment(15)
How about a pair (sum,product) of the terms in the set? Both of them together would not be common for different sets of numbers (as far as I have seen).Underclassman
For example something like (e1.hashCode() + e2.hashCode() + ... + en.hashCode()) ^ (e1.hashCode() * e2.hashCode() * ... * en.hashCode())?Baylor
Did you try to have a look at how Java implements this?Confirmatory
Just did, it sums the hashCode of the elementsBaylor
...and for equals uses size checking followed by a call to containsAll(). Do you have an issue with this sort of implementation?Confirmatory
Nope, but I wanted to know if there is a better solution, since this collides pretty easily. [1, 1].hashCode == [2].hashCode.Baylor
Can you not use the size of the collection as a salt? To avoid above mentioned collisions I mean.Ancheta
Yes, but then [1, 3].hashCode == [2, 2].hashCode. Since when is summing a good way to generate a hash code?Baylor
@Clashsoft: Wait... you allow duplicate elements in the arrays? Seems to be a easy way to complicate hashcode calculation. Otherwise you could use any operation that is both commutative and associative to combine the hash codes of the individual elements.Tribunal
Seeing as a Set is order independent, you're going to have to use binary operations that are commutative. I would suggest something like what you said above, except perhaps mod the answer by a non-changing large prime number so as to avoid incredibly large values.Prosaism
@Tribunal no, I don't o.O. That was just a bad example. But it makes no difference. A proper example would be [1, 4].hashCode == [2, 3].hashCodeBaylor
Must your mutable collections implement a hashcode by default? Seems like you'd generally want a hashcode to be constant for any given object. Some interesting reading here: artima.com/lejava/articles/equality.htmlPius
There is a big difference between "hash of the sum" and "sum of the hashes". The former, as your examples indicate, is problematic. The latter has fewer problems provided the individual hashes are well-distributed over a large range.Karame
run the same sort on both lists, then hash them.Gamesmanship
@iliketocode you’d have to make a copy and sorting would also increase time complexity from O(n) to O(n log n)Baylor
A
13

The JDK itself proposes the following solution to this problem. The contract of the java.util.Set interface states:

Returns the hash code value for this set. The hash code of a set is defined to be the sum of the hash codes of the elements in the set, where the hash code of a null element is defined to be zero. This ensures that s1.equals(s2) implies that s1.hashCode()==s2.hashCode() for any two sets s1 and s2, as required by the general contract of Object.hashCode().

An alternative to using the sum of the entries' hash codes would be to use, for example, the ^ (XOR) operator.

The Scala language uses an ordering-invariant version of the Murmurhash algorithm (cf. the private scala.util.hashing.MurmurHash3 class) to implement the hashCode (or ##) method of its immutable sets and similar collections.

Antler answered 9/6, 2015 at 14:44 Comment(5)
As I stated in the comments, I already found the JDK solution for this problem, but I want to know about more useful unordered collection hash algorithms with less collision potential.Baylor
@Baylor What collision potential? If just one of the individual hash codes works well, the entire hash algorithm will be evenly distributed.Hardness
@Hardness The distribution of a sum of uniform random variables is not uniform!Richert
@Richert There are very important differences between real numbers and 32-bit signed integers. This is one of them. The people writing java.set.Util knew what they were doing and came up with a good strategy here.Hardness
Scala's ordering-invariant hash is only based on the sum, product (excluding zeros), count, and XOR of the element hashes, though, so it's not as collision-resistant as "ordering-invariant version of the Murmurhash algorithm" makes it sound. It's better than just summing the hashes, but still not that great.Nada
A
0

Here's the pseudocode for a possible implementation:

String hashCode = null;
for(element : elements){
    hashCode = xor(hashCode, getHashCode(element));
}
return hashCode;

The xor function should return a string that is as long as the longest of the two arguments. It will XOR the bits in each until it gets to the end of one of the arguments. It will then take the remaining bits from the longer string and append those on.

This implementation will mean that the hashCode of a set will be as long as the hashCode of its longest element. Because you are XORing the bits, at the end the hashcode will be the same regardless of the order of your elements. However, as with any hashing implementation, there will be the chance for collisions.

Arnie answered 9/6, 2015 at 14:43 Comment(5)
But what would I do with a String when I need an int hashCode? This seems like a very resourceful solution.Baylor
@Baylor I wasn't sure if you wanted an int or a String. If it's just an int, then taking the sum of the hashCodes of the individual elements will get you what you need, as long as overflows wrap around instead of causing errors. If overflows cause errors, then you'll need to handle that case explicitly and wrap manually. Same concept though.Arnie
Thanks for the answer, but I want to find a different solution other than summing the hash codes of the elements (see comments).Baylor
@Baylor - Note that it really depends on your implementation of hashing numbers whether [1,4] will collide with [2,3]. Remember, you are summing the hash of the number and not the number. Also, though the string implementation is more resource intensive, that also means it will have more bits in the hash and thus less of a chance for collisions.Arnie
Typically, the hash code of an integer (if you are not going super-security) is the integer itself. So 1 .hashCode == 1. Also, since hashCode is enforced to be an int, I have to use string.hashCode in the end anyway.Baylor
S
0

You can calculate the hash sum sorting your collection in alphabetical order.

There is the C# sample - I hope you can translate it in Java :)

static String GetHash(List<String> l)
{
    using (System.Security.Cryptography.MD5 md5 = System.Security.Cryptography.MD5.Create())
    {
        return BitConverter.ToString(md5.ComputeHash(l.OrderBy(p => p).SelectMany(s => System.Text.Encoding.ASCII.GetBytes(s + (char)0)).ToArray())).Replace("-", "");
    }
}
Suppositive answered 24/10, 2017 at 11:15 Comment(0)
C
0

If you're looking for an out-of-the box solution, Guava could be utilized:

import java.util.Set;

import com.google.common.base.Charsets;
import com.google.common.hash.HashCode;
import com.google.common.hash.HashFunction;
import com.google.common.hash.Hashing;

...

public String hash(final Set<String> strings) {
    final HashFunction function = Hashing.murmur3_128();

    // Hashing.combineUnordered will throw an exception if input is empty.
    if (strings.isEmpty()) {
        return function.newHasher()
            .hash()
            .toString();
    }

    final List<HashCode> stringsHashes = strings.stream()
            .map(string -> function.newHasher()
                    .putString(string, Charsets.UTF_8)
                    .hash())
            .toList();

    return Hashing.combineUnordered(stringsHashes).toString();
}

To address authors concern:

... hash algorithms with less collision potential.

sha256 could be used instead of murmur3_128.

Also, you can adapt it to any input type, not just strings, using Funnels.

Carsoncarstensz answered 24/11, 2023 at 5:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.