Python frozenset hashing algorithm / implementation
Asked Answered
C

3

35

I'm currently trying to understand the mechanism behind the hash function defined for Python's built-in frozenset data type. The implementation is shown at the bottom for reference. What I'm interested in particular is the rationale for the choice of this scattering operation:

lambda h: (h ^ (h << 16) ^ 89869747) * 3644798167

where h is the hash of each element. Does anyone know where these came from? (That is, was there any particular reason to pick these numbers?) Or were they simply chosen arbitrarily?


Here is the snippet from the official CPython implementation,

static Py_hash_t
frozenset_hash(PyObject *self)
{
    PySetObject *so = (PySetObject *)self;
    Py_uhash_t h, hash = 1927868237UL;
    setentry *entry;
    Py_ssize_t pos = 0;

    if (so->hash != -1)
        return so->hash;

    hash *= (Py_uhash_t)PySet_GET_SIZE(self) + 1;
    while (set_next(so, &pos, &entry)) {
        /* Work to increase the bit dispersion for closely spaced hash
           values.  The is important because some use cases have many
           combinations of a small number of elements with nearby
           hashes so that many distinct combinations collapse to only
           a handful of distinct hash values. */
        h = entry->hash;
        hash ^= (h ^ (h << 16) ^ 89869747UL)  * 3644798167UL;
    }
    hash = hash * 69069U + 907133923UL;
    if (hash == -1)
        hash = 590923713UL;
    so->hash = hash;
    return hash;
}

and an equivalent implementation in Python:

def _hash(self):
    MAX = sys.maxint
    MASK = 2 * MAX + 1
    n = len(self)
    h = 1927868237 * (n + 1)
    h &= MASK
    for x in self:
        hx = hash(x)
        h ^= (hx ^ (hx << 16) ^ 89869747)  * 3644798167
        h &= MASK
    h = h * 69069 + 907133923
    h &= MASK
    if h > MAX:
        h -= MASK + 1
    if h == -1:
        h = 590923713
    return h
Cyperaceous answered 30/12, 2013 at 1:54 Comment(1)
Edit: I removed the question regarding why the hash-combining function was associative and commutative: since the scattered element hash is XORed with the accumulator it's has to be so because of XOR! Silly me.Cyperaceous
F
35

Unless Raymond Hettinger (the code's author) chimes in, we'll never know for sure ;-) But there's usually less "science" in these things than you might expect: you take some general principles, and a test suite, and fiddle the constants almost arbitrarily until the results look "good enough".

Some general principles "obviously" at work here:

  1. To get the desired quick "bit dispersion", you want to multiply by a large integer. Since CPython's hash result has to fit in 32 bits on many platforms, an integer that requires 32 bits is best for this. And, indeed, (3644798167).bit_length() == 32.

  2. To avoid systematically losing the low-order bit(s), you want to multiply by an odd integer. 3644798167 is odd.

  3. More generally, to avoid compounding patterns in the input hashes, you want to multiply by a prime. And 3644798167 is prime.

  4. And you also want a multiplier whose binary representation doesn't have obvious repeating patterns. bin(3644798167) == '0b11011001001111110011010011010111'. That's pretty messed up, which is a good thing ;-)

The other constants look utterly arbitrary to me. The

if h == -1:
    h = 590923713

part is needed for another reason: internally, CPython takes a -1 return value from an integer-valued C function as meaning "an exception needs to be raised"; i.e., it's an error return. So you'll never see a hash code of -1 for any object in CPython. The value returned instead of -1 is wholly arbitrary - it just needs to be the same value (instead of -1) each time.

EDIT: playing around

I don't know what Raymond used to test this. Here's what I would have used: look at hash statistics for all subsets of a set of consecutive integers. Those are problematic because hash(i) == i for a great many integers i.

>>> all(hash(i) == i for i in range(1000000))
True

Simply xor'ing hashes together will yield massive cancellation on inputs like that.

So here's a little function to generate all subsets, and another to do a dirt-simple xor across all hash codes:

def hashxor(xs):
    h = 0
    for x in xs:
        h ^= hash(x)
    return h

def genpowerset(xs):
    from itertools import combinations
    for length in range(len(xs) + 1):
        for t in combinations(xs, length):
            yield t

Then a driver, and a little function to display collision statistics:

def show_stats(d):
    total = sum(d.values())
    print "total", total, "unique hashes", len(d), \
          "collisions", total - len(d)

def drive(n, hasher=hashxor):
    from collections import defaultdict
    d = defaultdict(int)

    for t in genpowerset(range(n)):
        d[hasher(t)] += 1
    show_stats(d)

Using the dirt-simple hasher is disastrous:

>> drive(20)
total 1048576 unique hashes 32 collisions 1048544

Yikes! OTOH, using the _hash() designed for frozensets does a perfect job in this case:

>>> drive(20, _hash)
total 1048576 unique hashes 1048576 collisions 0

Then you can play with that to see what does - and doesn't - make a real difference in _hash(). For example, it still does a perfect job on these inputs if

    h = h * 69069 + 907133923

is removed. And I have no idea why that line is there. Similarly, it continues to do a perfect job on these inputs if the ^ 89869747 in the inner loop is removed - don't know why that's there either. And initialization can be changed from:

    h = 1927868237 * (n + 1)

to:

    h = n

without harm here too. That all jibes with what I expected: it's the multiplicative constant in the inner loop that's crucial, for reasons already explained. For example, add 1 to it (use 3644798168) and then it's no longer prime or odd, and the stats degrade to:

total 1048576 unique hashes 851968 collisions 196608

Still quite usable, but definitely worse. Change it to a small prime, like 13, and it's worse:

total 1048576 unique hashes 483968 collisions 564608

Use a multiplier with an obvious binary pattern, like 0b01010101010101010101010101010101, and worse again:

total 1048576 unique hashes 163104 collisions 885472

Play around! These things are fun :-)

Foy answered 30/12, 2013 at 4:18 Comment(1)
Thank you and I appreciate your analysis of the algorithm! :)Cyperaceous
C
54

The problem being solved is that the previous hash algorithm in Lib/sets.py had horrendous performance on datasets that arise in a number of graph algorithms (where nodes are represented as frozensets):

# Old-algorithm with bad performance

def _compute_hash(self):
    result = 0
    for elt in self:
        result ^= hash(elt)
    return result

def __hash__(self):
    if self._hashcode is None:
        self._hashcode = self._compute_hash()
    return self._hashcode

A new algorithm was created because it had much better performance. Here is an overview of the salient parts of the new algorithm:

1) The xor-equal in h ^= (hx ^ (hx << 16) ^ 89869747) * 3644798167 is necessary so that the algorithm is commutative (the hash does not depend on the order that set elements are encountered). Since sets has an unordered equality test, the hash for frozenset([10, 20]) needs to be the same as for frozenset([20, 10]).

2) The xor with89869747 was chosen for its interesting bit pattern 101010110110100110110110011 which is used to break-up sequences of nearby hash values prior to multiplying by 3644798167, a randomly chosen large prime with another interesting bit pattern.

3) The xor with hx << 16 was included so that the lower bits had two chances to affect the outcome (resulting in better dispersion of nearby hash values). In this, I was inspired by how CRC algorithms shuffled bits back on to themselves.

4) If I recall correctly, the only one of the constants that is special is 69069. It had some history from the world of linear congruential random number generators. See https://www.google.com/search?q=69069+rng for some references.

5) The final step of computing hash = hash * 69069U + 907133923UL was added to handle cases with nested frozensets and to make the algorithm disperse in a pattern orthogonal to the hash algorithms for other objects (strings, tuples, ints, etc).

6) Most of the other constants were randomly chosen large prime numbers.

Though I would like to claim divine inspiration for the hash algorithm, the reality was that I took a bunch of badly performing datasets, analyzed why their hashes weren't dispersing, and then toyed with the algorithm until the collision statistics stopped being so embarrassing.

For example, here is an efficacy test from Lib/test/test_set.py that failed for algorithms with less diffusion:

def test_hash_effectiveness(self):
    n = 13
    hashvalues = set()
    addhashvalue = hashvalues.add
    elemmasks = [(i+1, 1<<i) for i in range(n)]
    for i in xrange(2**n):
        addhashvalue(hash(frozenset([e for e, m in elemmasks if m&i])))
    self.assertEqual(len(hashvalues), 2**n)

Other failing examples included powersets of strings and small integer ranges as well as the graph algorithms in the test suite: See TestGraphs.test_cuboctahedron and TestGraphs.test_cube in Lib/test/test_set.py.

Cohesion answered 5/1, 2014 at 8:13 Comment(4)
I provided some test code to Raymond while he was working on this. For various nesting patterns of sets, it randomly generates thousands of sets and compares the number of collisions against the expected number for an ideal hash. I also provided an initial hash implementation. It was similar to Raymond's, but with multiple steps and less carefully chosen constants. Raymond's final result improves significantly on speed and simplicity, and performs almost as well on the randomness tests.Ocreate
For a multiset, would this algorithm fail if used in the form Set._hash(['a', 'a', b'])—will duplicate elements mess with it?Pat
@Pat You're right that Set._hash isn't appropriate for a multiset. However, it an easy modify. For a multiset, just change the element hash to include the count of elements: hx = hash(x) -> hx = hash((x, cnt)).Cohesion
@RaymondHettinger Would it be appropriate to mix in a constant as well to prevent inter-type collisions (like hash(frozenset({('a', 2), ('b', 1)})) == hash(FrozenMultiset(['a', 'a', 'b'])))? Something like Set._hash({self.__HASH_CONSTANT}.union(self._impl_dict.items())), where the constant is a dumb one-off object(), seems to produce usable results, and I can't see any obvious ways for that to go wrong...Pat
F
35

Unless Raymond Hettinger (the code's author) chimes in, we'll never know for sure ;-) But there's usually less "science" in these things than you might expect: you take some general principles, and a test suite, and fiddle the constants almost arbitrarily until the results look "good enough".

Some general principles "obviously" at work here:

  1. To get the desired quick "bit dispersion", you want to multiply by a large integer. Since CPython's hash result has to fit in 32 bits on many platforms, an integer that requires 32 bits is best for this. And, indeed, (3644798167).bit_length() == 32.

  2. To avoid systematically losing the low-order bit(s), you want to multiply by an odd integer. 3644798167 is odd.

  3. More generally, to avoid compounding patterns in the input hashes, you want to multiply by a prime. And 3644798167 is prime.

  4. And you also want a multiplier whose binary representation doesn't have obvious repeating patterns. bin(3644798167) == '0b11011001001111110011010011010111'. That's pretty messed up, which is a good thing ;-)

The other constants look utterly arbitrary to me. The

if h == -1:
    h = 590923713

part is needed for another reason: internally, CPython takes a -1 return value from an integer-valued C function as meaning "an exception needs to be raised"; i.e., it's an error return. So you'll never see a hash code of -1 for any object in CPython. The value returned instead of -1 is wholly arbitrary - it just needs to be the same value (instead of -1) each time.

EDIT: playing around

I don't know what Raymond used to test this. Here's what I would have used: look at hash statistics for all subsets of a set of consecutive integers. Those are problematic because hash(i) == i for a great many integers i.

>>> all(hash(i) == i for i in range(1000000))
True

Simply xor'ing hashes together will yield massive cancellation on inputs like that.

So here's a little function to generate all subsets, and another to do a dirt-simple xor across all hash codes:

def hashxor(xs):
    h = 0
    for x in xs:
        h ^= hash(x)
    return h

def genpowerset(xs):
    from itertools import combinations
    for length in range(len(xs) + 1):
        for t in combinations(xs, length):
            yield t

Then a driver, and a little function to display collision statistics:

def show_stats(d):
    total = sum(d.values())
    print "total", total, "unique hashes", len(d), \
          "collisions", total - len(d)

def drive(n, hasher=hashxor):
    from collections import defaultdict
    d = defaultdict(int)

    for t in genpowerset(range(n)):
        d[hasher(t)] += 1
    show_stats(d)

Using the dirt-simple hasher is disastrous:

>> drive(20)
total 1048576 unique hashes 32 collisions 1048544

Yikes! OTOH, using the _hash() designed for frozensets does a perfect job in this case:

>>> drive(20, _hash)
total 1048576 unique hashes 1048576 collisions 0

Then you can play with that to see what does - and doesn't - make a real difference in _hash(). For example, it still does a perfect job on these inputs if

    h = h * 69069 + 907133923

is removed. And I have no idea why that line is there. Similarly, it continues to do a perfect job on these inputs if the ^ 89869747 in the inner loop is removed - don't know why that's there either. And initialization can be changed from:

    h = 1927868237 * (n + 1)

to:

    h = n

without harm here too. That all jibes with what I expected: it's the multiplicative constant in the inner loop that's crucial, for reasons already explained. For example, add 1 to it (use 3644798168) and then it's no longer prime or odd, and the stats degrade to:

total 1048576 unique hashes 851968 collisions 196608

Still quite usable, but definitely worse. Change it to a small prime, like 13, and it's worse:

total 1048576 unique hashes 483968 collisions 564608

Use a multiplier with an obvious binary pattern, like 0b01010101010101010101010101010101, and worse again:

total 1048576 unique hashes 163104 collisions 885472

Play around! These things are fun :-)

Foy answered 30/12, 2013 at 4:18 Comment(1)
Thank you and I appreciate your analysis of the algorithm! :)Cyperaceous
A
4

In

(h ^ (h << 16) ^ 89869747) * 3644798167

the multiplicative integer is a large prime to reduce collisions. This is especially relevant since the operation is under modulo.

The rest is probably arbitrary; I see no reason for the 89869747 to be specific. The most important usage you would get out of that is enlarging hashes of small numbers (most integers hash to themselves). This prevents high collisions for sets of small integers.

That's all I can think of. What do you need this for?

Attendance answered 30/12, 2013 at 3:51 Comment(1)
Curiosity mostly. Needed a way to hash trees with associative and commutative branches (i.e. each node is an unordered multiset: Data.HashMap) and wanted to look at how Python handles the hashing of frozensets.Cyperaceous

© 2022 - 2024 — McMap. All rights reserved.