What's a correct and good way to implement __hash__()?
Asked Answered
T

8

270

What's a correct and good way to implement __hash__()?

I am talking about the function that returns a hashcode that is then used to insert objects into hashtables aka dictionaries.

As __hash__() returns an integer and is used for "binning" objects into hashtables I assume that the values of the returned integer should be uniformly distributed for common data (to minimize collisions). What's a good practice to get such values? Are collisions a problem? In my case I have a small class which acts as a container class holding some ints, some floats and a string.

Thurstan answered 25/5, 2010 at 22:56 Comment(0)
S
304

An easy, correct way to implement __hash__() is to use a key tuple. It won't be as fast as a specialized hash, but if you need that then you should probably implement the type in C.

Here's an example of using a key for hash and equality:

class A:
    def __key(self):
        return (self.attr_a, self.attr_b, self.attr_c)

    def __hash__(self):
        return hash(self.__key())

    def __eq__(self, other):
        if isinstance(other, A):
            return self.__key() == other.__key()
        return NotImplemented

Also, the documentation of __hash__ has more information, that may be valuable in some particular circumstances.

Steel answered 25/5, 2010 at 22:59 Comment(6)
Aside from the minor overhead from factoring out the __key function, this is about as fast as any hash can be. Sure, if the attributes are known to be integers, and there aren't too many of them, I suppose you could potentially run slightly faster with some home-rolled hash, but it likely wouldn't be as well distributed. hash((self.attr_a, self.attr_b, self.attr_c)) is going to be surprisingly fast (and correct), as creation of small tuples is specially optimized, and it pushes the work of getting and combining hashes to C builtins, which is typically faster than Python level code.Fulmis
Let's say an object of class A is being used as a key for a dictionary and if an attribute of class A changes, its hash value will change as well. Wouldn't that create a problem ?Petitioner
As @loved.by.Jesus's answer below mentions, hash method should not be defined/overridden for a mutable object (defined by default and uses id for equality and comparison).Petitioner
@Miguel, I ran into the exact problem, what happens is the dictionary returns None once the key changes. The way I solved it was by storing the id of the object as a key instead of just the object.Jae
@JaswantP Python by default uses id of the object as the key for any hashable object.Petitioner
True that, but if the __hash__ method is redefined, then None would be returned if the object's attributes were changedJae
A
36

John Millikin proposed a solution similar to this:

class A(object):

    def __init__(self, a, b, c):
        self._a = a
        self._b = b
        self._c = c

    def __eq__(self, othr):
        return (isinstance(othr, type(self))
                and (self._a, self._b, self._c) ==
                    (othr._a, othr._b, othr._c))

    def __hash__(self):
        return hash((self._a, self._b, self._c))

The problem with this solution is that the hash(A(a, b, c)) == hash((a, b, c)). In other words, the hash collides with that of the tuple of its key members. Maybe this does not matter very often in practice?

Update: the Python docs now recommend to use a tuple as in the example above. Note that the documentation states

The only required property is that objects which compare equal have the same hash value

Note that the opposite is not true. Objects which do not compare equal may have the same hash value. Such a hash collision will not cause one object to replace another when used as a dict key or set element as long as the objects do not also compare equal.

Outdated/bad solution

The Python documentation on __hash__ suggests to combine the hashes of the sub-components using something like XOR, which gives us this:

class B(object):

    def __init__(self, a, b, c):
        self._a = a
        self._b = b
        self._c = c

    def __eq__(self, othr):
        if isinstance(othr, type(self)):
            return ((self._a, self._b, self._c) ==
                    (othr._a, othr._b, othr._c))
        return NotImplemented

    def __hash__(self):
        return (hash(self._a) ^ hash(self._b) ^ hash(self._c) ^
                hash((self._a, self._b, self._c)))

Update: as Blckknght points out, changing the order of a, b, and c could cause problems. I added an additional ^ hash((self._a, self._b, self._c)) to capture the order of the values being hashed. This final ^ hash(...) can be removed if the values being combined cannot be rearranged (for example, if they have different types and therefore the value of _a will never be assigned to _b or _c, etc.).

Alexandrite answered 29/9, 2013 at 0:3 Comment(6)
You usually don't want to do a straight XOR the attributes together, as that will give you collisions if you change the order of the values. That is, hash(A(1, 2, 3)) will be equal to hash(A(3, 1, 2)) (and they'll both hash equal to any other A instance with a permutation of 1, 2 and 3 as its values). If you want to avoid your instance having the same hash as a tuple of their arguments, simply create a sentinel value (as either a class variable, or as a global) then include it in the tuple to be hashed: return hash((_sentinel, self._a, self._b, self._c))Instate
Your use of isinstance could be problematic, since an object of a subclass of type(self) can now be equal to an object of type(self). So you may find that adding a Car and a Ford to a set() may result in only one object inserted, depending on the order of insertion. Additionally, you may run into a situation where a == b is True but b == a is False.Doug
If you're subclassing B, you may want to change that to isinstance(othr, B)Alexandrite
A thought: the key tuple could include the class type, which would prevent other classes with the same key set of attributes from being shown to be equal: hash((type(self), self._a, self._b, self._c)).Ganges
Besides the point about using B instead of type(self), it's also often considered better practice to return NotImplemented when encountering an unexpected type in __eq__ instead of False. That allows other user-defined types to implement an __eq__ that knows about B and can compare equal to it, if they wish to.Hexosan
The XOR recommendation was removed from the docs, because it's a terrible way to combine hashes. It now recommends hashing a tuple, because it's by far the safest way to make a hash.Fulmis
S
18

Paul Larson of Microsoft Research studied a wide variety of hash functions. He told me that

for c in some_string:
    hash = 101 * hash  +  ord(c)

worked surprisingly well for a wide variety of strings. I've found that similar polynomial techniques work well for computing a hash of disparate subfields.

Springfield answered 26/5, 2010 at 1:5 Comment(5)
Apparently Java does it the same way but using 31 instead of 101Thurstan
What's the rationale behind using these numbers? Is there a reason to pick 101, or 31?Linkoski
Here's an explanation for prime multipliers: #3613602. 101 seems to work particularly well, based on Paul Larson's experiments.Springfield
Python uses (hash * 1000003) XOR ord(c) for strings with 32-bit wraparound multiplication. [Citation]Unpile
Even if this is true it is of no practical use in this context because the builtin Python string types already provide a __hash__ method; we don't need to roll our own. The question is how to implement __hash__ for a typical user-defined class (with a bunch of properties pointing to built-in types or perhaps to other such user-defined classes), which this answer doesn't address at all.Hexosan
D
6

A good way to implement hash (as well as list, dict, tuple) is to make the object have a predictable order of items by making it iterable using __iter__. So to modify an example from above:

class A:

    def __init__(self, a, b, c):
        self._a = a
        self._b = b
        self._c = c

    def __iter__(self):
        yield "a", self._a
        yield "b", self._b
        yield "c", self._c

    def __hash__(self):
        return hash(tuple(self))

    def __eq__(self, other):
        return (isinstance(other, type(self))
                and tuple(self) == tuple(other))

(here __eq__ is not required for hash, but it's easy to implement).

Now add some mutable members to see how it works:

a = 2; b = 2.2; c = 'cat'
hash(A(a, b, c))  # -5279839567404192660
dict(A(a, b, c))  # {'a': 2, 'b': 2.2, 'c': 'cat'}
list(A(a, b, c))  # [('a', 2), ('b', 2.2), ('c', 'cat')]
tuple(A(a, b, c)) # (('a', 2), ('b', 2.2), ('c', 'cat'))

things only fall apart if you try to put non-hashable members in the object model:

hash(A(a, b, [1]))  # TypeError: unhashable type: 'list'
Dimerous answered 30/3, 2021 at 7:35 Comment(0)
B
3

I can try to answer the second part of your question.

The collisions will probably result not from the hash code itself, but from mapping the hash code to an index in a collection. So for example your hash function could return random values from 1 to 10000, but if your hash table only has 32 entries you'll get collisions on insertion.

In addition, I would think that collisions would be resolved by the collection internally, and there are many methods to resolve collisions. The simplest (and worst) is, given an entry to insert at index i, add 1 to i until you find an empty spot and insert there. Retrieval then works the same way. This results in inefficient retrievals for some entries, as you could have an entry that requires traversing the entire collection to find!

Other collision resolution methods reduce the retrieval time by moving entries in the hash table when an item is inserted to spread things out. This increases the insertion time but assumes you read more than you insert. There are also methods that try and branch different colliding entries out so that entries to cluster in one particular spot.

Also, if you need to resize the collection you will need to rehash everything or use a dynamic hashing method.

In short, depending on what you're using the hash code for you may have to implement your own collision resolution method. If you're not storing them in a collection, you can probably get away with a hash function that just generates hash codes in a very large range. If so, you can make sure your container is bigger than it needs to be (the bigger the better of course) depending on your memory concerns.

Here are some links if you're interested more:

coalesced hashing on wikipedia

Wikipedia also has a summary of various collision resolution methods:

Also, "File Organization And Processing" by Tharp covers alot of collision resolution methods extensively. IMO it's a great reference for hashing algorithms.

Beefburger answered 26/5, 2010 at 0:58 Comment(0)
I
3

A very good explanation on when and how implement the __hash__ function is on programiz website:

Just a screenshot to provide an overview: (Retrieved 2019-12-13)

Screenshot of https://www.programiz.com/python-programming/methods/built-in/hash 2019-12-13

As for a personal implementation of the method, the above mentioned site provides an example that matches the answer of millerdev.

class Person:
def __init__(self, age, name):
    self.age = age
    self.name = name

def __eq__(self, other):
    return self.age == other.age and self.name == other.name

def __hash__(self):
    print('The hash is:')
    return hash((self.age, self.name))

person = Person(23, 'Adam')
print(hash(person))
Intertwine answered 13/12, 2019 at 11:14 Comment(0)
U
2

@dataclass(frozen=True) (Python 3.7)

This awesome new feature, among other good things, automatically defines a __hash__ and __eq__ method for you, making it just work as usually expected in dicts and sets without any need for tedious repetition:

dataclass_cheat.py

from dataclasses import dataclass, FrozenInstanceError

@dataclass(frozen=True)
class MyClass1:
    n: int
    s: str

@dataclass(frozen=True)
class MyClass2:
    n: int
    my_class_1: MyClass1

d = {}
d[MyClass1(n=1, s='a')] = 1
d[MyClass1(n=2, s='a')] = 2
d[MyClass1(n=2, s='b')] = 3
d[MyClass2(n=1, my_class_1=MyClass1(n=1, s='a'))] = 4
d[MyClass2(n=2, my_class_1=MyClass1(n=1, s='a'))] = 5
d[MyClass2(n=2, my_class_1=MyClass1(n=2, s='a'))] = 6

assert d[MyClass1(n=1, s='a')] == 1
assert d[MyClass1(n=2, s='a')] == 2
assert d[MyClass1(n=2, s='b')] == 3
assert d[MyClass2(n=1, my_class_1=MyClass1(n=1, s='a'))] == 4
assert d[MyClass2(n=2, my_class_1=MyClass1(n=1, s='a'))] == 5
assert d[MyClass2(n=2, my_class_1=MyClass1(n=2, s='a'))] == 6

# Due to `frozen=True` we can't modify objects.
o = MyClass1(n=1, s='a')
try:
    o.n = 2
except FrozenInstanceError as e:
    pass
else:
    raise 'error'

As we can see in this example, the hashes are being calculated based on the contents of the objects, and not simply on the addresses of instances. This is why something like:

d = {}
d[MyClass1(n=1, s='a')] = 1
assert d[MyClass1(n=1, s='a')] == 1

works even though the second MyClass1(n=1, s='a') is a completely different instance from the first with a different address.

frozen=True is mandatory, otherwise the class is not hashable, otherwise it would make it possible for users to inadvertently make containers inconsistent by modifying objects after they are used as keys. Further documentation: https://docs.python.org/3/library/dataclasses.html

Tested on Python 3.10.7, Ubuntu 22.10.

Unchartered answered 2/2, 2023 at 13:54 Comment(2)
what overhead will decorating with dataclass add?Lurid
@NaveenReddyMarthala as opposed to manually defining __init__, __hash__, __eq__? Basically none, as it produces the same class I believe, except one single function call to create the class itself, which normally only runs once.Unchartered
A
0

Depends on the size of the hash value you return. It's simple logic that if you need to return a 32bit int based on the hash of four 32bit ints, you're gonna get collisions.

I would favor bit operations. Like, the following C pseudo code:

int a;
int b;
int c;
int d;
int hash = (a & 0xF000F000) | (b & 0x0F000F00) | (c & 0x00F000F0 | (d & 0x000F000F);

Such a system could work for floats too, if you simply took them as their bit value rather than actually representing a floating-point value, maybe better.

For strings, I've got little/no idea.

Ameliaamelie answered 25/5, 2010 at 23:11 Comment(1)
I know that there will be collisions. But I have no clue how these are handled. And furhermore my attribute values in combination are very sparsely distributed so I was looking for a smart solution. And somehow I expected there to be a best practice somewhere.Thurstan

© 2022 - 2024 — McMap. All rights reserved.