Unless Raymond Hettinger (the code's author) chimes in, we'll never know for sure ;-) But there's usually less "science" in these things than you might expect: you take some general principles, and a test suite, and fiddle the constants almost arbitrarily until the results look "good enough".
Some general principles "obviously" at work here:
To get the desired quick "bit dispersion", you want to multiply by a large integer. Since CPython's hash result has to fit in 32 bits on many platforms, an integer that requires 32 bits is best for this. And, indeed, (3644798167).bit_length() == 32
.
To avoid systematically losing the low-order bit(s), you want to multiply by an odd integer. 3644798167 is odd.
More generally, to avoid compounding patterns in the input hashes, you want to multiply by a prime. And 3644798167 is prime.
And you also want a multiplier whose binary representation doesn't have obvious repeating patterns. bin(3644798167) == '0b11011001001111110011010011010111'
. That's pretty messed up, which is a good thing ;-)
The other constants look utterly arbitrary to me. The
if h == -1:
h = 590923713
part is needed for another reason: internally, CPython takes a -1
return value from an integer-valued C function as meaning "an exception needs to be raised"; i.e., it's an error return. So you'll never see a hash code of -1
for any object in CPython. The value returned instead of -1
is wholly arbitrary - it just needs to be the same value (instead of -1) each time.
EDIT: playing around
I don't know what Raymond used to test this. Here's what I would have used: look at hash statistics for all subsets of a set of consecutive integers. Those are problematic because hash(i) == i
for a great many integers i
.
>>> all(hash(i) == i for i in range(1000000))
True
Simply xor'ing hashes together will yield massive cancellation on inputs like that.
So here's a little function to generate all subsets, and another to do a dirt-simple xor across all hash codes:
def hashxor(xs):
h = 0
for x in xs:
h ^= hash(x)
return h
def genpowerset(xs):
from itertools import combinations
for length in range(len(xs) + 1):
for t in combinations(xs, length):
yield t
Then a driver, and a little function to display collision statistics:
def show_stats(d):
total = sum(d.values())
print "total", total, "unique hashes", len(d), \
"collisions", total - len(d)
def drive(n, hasher=hashxor):
from collections import defaultdict
d = defaultdict(int)
for t in genpowerset(range(n)):
d[hasher(t)] += 1
show_stats(d)
Using the dirt-simple hasher is disastrous:
>> drive(20)
total 1048576 unique hashes 32 collisions 1048544
Yikes! OTOH, using the _hash()
designed for frozensets does a perfect job in this case:
>>> drive(20, _hash)
total 1048576 unique hashes 1048576 collisions 0
Then you can play with that to see what does - and doesn't - make a real difference in _hash()
. For example, it still does a perfect job on these inputs if
h = h * 69069 + 907133923
is removed. And I have no idea why that line is there. Similarly, it continues to do a perfect job on these inputs if the ^ 89869747
in the inner loop is removed - don't know why that's there either. And initialization can be changed from:
h = 1927868237 * (n + 1)
to:
h = n
without harm here too. That all jibes with what I expected: it's the multiplicative constant in the inner loop that's crucial, for reasons already explained. For example, add 1 to it (use 3644798168) and then it's no longer prime or odd, and the stats degrade to:
total 1048576 unique hashes 851968 collisions 196608
Still quite usable, but definitely worse. Change it to a small prime, like 13, and it's worse:
total 1048576 unique hashes 483968 collisions 564608
Use a multiplier with an obvious binary pattern, like 0b01010101010101010101010101010101
, and worse again:
total 1048576 unique hashes 163104 collisions 885472
Play around! These things are fun :-)