What is the default __hash__ in python?
Asked Answered
F

6

81

I am quite often using funky stuff as keys for dictionaries, and therefore, I am wondering what is the right way to do it - and this goes through implementing good hash methods for my objects. I am aware of other questions asked here like good way to implement hash, but I'd like to understand how the default __hash__ works for custom objects, and if it is possible to rely on it.

I have noticed that mutables are explicitely unhashable since hash({}) raises an error ... but strangely, custom classes are hashable :

>>> class Object(object): pass
>>> o = Object()
>>> hash(o)

So, does anybody knows how this default hash function works ? By understanding this, I'd like to know :

Can I rely on this default hash, if I put objects of a same type as keys of a dictionary ? e.g. :

key1 = MyObject()
key2 = MyObject()
key3 = MyObject()
{key1: 1, key2: 'blabla', key3: 456}

Can I rely on it if I use objects of different types as keys in a dictionary ? e.g.

{int: 123, MyObject(10): 'bla', 'plo': 890}

And in the last case also, how to make sure that my custom hashes don't clash with the builtin hashes ? e.g :

{int: 123, MyObject(10): 'bla', MyObjectWithCustomHash(123): 890}
Feeler answered 4/7, 2012 at 7:24 Comment(4)
https://mcmap.net/q/108582/-what-39-s-a-correct-and-good-way-to-implement-__hash__Contravention
@gnibbler : got that already - see the link in the questionFeeler
Unrelated, but a good point to note, "If you're going to override __hash__, override __eq__ as well."Oriental
User-defined classes have __eq__() and __hash__() methods by default; with them, all objects compare unequal (except with themselves) and x.__hash__() returns an appropriate value such that x == y implies both that x is y and hash(x) == hash(y). docs.python.org/3/reference/datamodel.html#object.__hash__Paolapaolina
C
45

What you can rely on: custom objects have a default hash() that is based in some way on the identity of the object. i.e. any object using the default hash will have a constant value for that hash over its lifetime and different objects may or may not have a different hash value.

You cannot rely on any particular relationship between the value returned by id() and the value returned by hash(). In the standard C implementation of Python 2.6 and earlier they were the same, in Python 2.7-3.2 hash(x)==id(x)/16.

Edit: originally I wrote that in releases 3.2.3 and later or 2.7.3 or later the hash value may be randomised and in Python 3.3 the relationship will always be randomised. In fact that randomisation at present only applies to hashing strings so in fact the divide by 16 relationship may continue to hold for now, but don't bank on it.

Hash collisions don't usually matter: in a dictionary lookup to find an object it must have the same hash and must also compare equal. Collisions only matter if you get a very high proportion of collisions such as in the denial of service attack that led to recent versions of Python being able to randomise the hash calculation.

Cameron answered 4/7, 2012 at 7:57 Comment(1)
FWIW: the low-order bits are rolled rather than truncated in 3.3 and up - at least as far as 3.12.Eberhardt
W
20

In Python 3 the following function is used on subclasses of object against the id() of the object (from pyhash.c)

Py_hash_t
_Py_HashPointer(void *p)
{
    Py_hash_t x;
    size_t y = (size_t)p;
    /* bottom 3 or 4 bits are likely to be 0; rotate y by 4 to avoid
       excessive hash collisions for dicts and sets */
    y = (y >> 4) | (y << (8 * SIZEOF_VOID_P - 4));
    x = (Py_hash_t)y;
    if (x == -1)
        x = -2;
    return x;
}

SIZEOF_VOID_P is 8 for 64-bit Python and 4 for 32-bit Python.

>>> class test: pass
...
>>> a = test()
>>> id(a)
4325845928
>>> hash(a)
-9223372036584410438

You can see that the hash is calculated from id(a) using the formula (id(a) >> 4) | (id(a) << (8 * SIZEOF_VOID_P - 4)), where the bitwise operations are performed on C signed integers. For example, for the a defined above:

>>> import numpy
>>> y = numpy.array([4325845928], dtype='int64')
>>> SIZEOF_VOID_P = 8
>>> (y >> 4) | (y << (8 * SIZEOF_VOID_P - 4))
array([-9223372036584410438])

Note that I am using numpy.array(dtype='int64') so that the bitwise operations act the same way they would in C (if you perform the same operation on Python ints you get different behavior because they don't overflow). See https://mcmap.net/q/260604/-bitwise-operations-in-python.

Wing answered 6/11, 2015 at 17:31 Comment(5)
According to Duncan – In Python 3.3 there won't even be a fixed relationship between id() and hash().Uncinus
@PiotrDobrogost there is a fixed relationship. It's (id(x) >> 4) | (id(x) << (8 * SIZEOF_VOID_P - 4)). The code I pasted here is taken from the Python 3 source. d (the input to the _Py_HashPointer function) is the memory address of the object, i.e., its id(). Run SIZEOF_VOID_P = 8; y = numpy.array([4325845928], dtype='int64'); print((y >> 4) | (y << (8 * SIZEOF_VOID_P - 4))). The result is -9223372036584410438, which corresponds to the example I showed above.Wing
I think Duncan meant hash randomization introduced in Python 3.3. However it's currently only active for strings and the code you show is probably for general case.Uncinus
size_t is unsigned in C (not signed), so as the comment remarks, this is a roll operationKaddish
As of 3.12 it still works this way, although it's been refactored. This is the default implementation found in the base object type; any other type (such as str) is free to override it.Eberhardt
B
13

The documentation states that custom objects rely on id() as their hash() implementation:

CPython implementation detail: This is the address of the object in memory.

If you mix custom objects with builtin types like int their might be hash collisions, but that's no problem at all if they are equally distributed. Don't investigate too much unless you really hit a performance problem.

Briquette answered 4/7, 2012 at 7:33 Comment(7)
so, do you mean that if I use only custom types, there shouldn't be collisions ?Feeler
Right, the id is unique. The thing with other types is that they don’t necessarily use id() but often a more reasonable hash value; for example ints use just their value as their hash value.Jovitta
So : {int: 123, MyObject(): 465, MyType: 890} should be safe, right ?Feeler
Also ... let me say once again than despite what the doc says, in Python 2.7 id(custom_obj) != hash(custom_obj)Feeler
They are "safe" although their might be hash collisions. Its just a performance concern. To your question, no, their may occur hash collisions if your keys aren't only custom object instances.Briquette
So now, last part of the question, do you know a way to build a custom hash method which doesn't collide with default id-based hash of a custom type ? I have a dict in which keys are either classes, or instances of a custom type e.g: {int: 123, MyObjectWithCustomHash('bla'): 456, SomeCustomType: 890}Feeler
@sebpiq: This shouldn't be a problem. All objects in Python are "first-class", and thus have a separate id, including classes, instances of classes, and custom types.Diphase
C
10

The default hash for user-defined classes is to just return their id. This gives a behaviour that is often useful; using an instance of a user-defined class as a dictionary key will allow the associated value to be retrieved when exactly the same object is provided again to lookup the value. e.g:

>>> class Foo(object):
    def __init__(self, foo):
        self.foo = foo


>>> f = Foo(10)
>>> d = {f: 10}
>>> d[f]
10

This matches the default equality of user-defined classes:

>>> g = Foo(10)
>>> f == g
False
>>> d[g]

Traceback (most recent call last):
  File "<pyshell#9>", line 1, in <module>
    d[g]
KeyError: <__main__.Foo object at 0x0000000002D69390>

Note that even though f and g have the same values for their attributes, they are not equal and looking up g in d doesn't find the value stored under f. Furthermore, even if we change the value of f.foo, looking up f in d still finds the value:

>>> f.foo = 11
>>> d[f]
10

The assumption is that instances of some arbitrary new class should be treated as non-equivalent, unless the programmer specifically declares the conditions for two instances to be treated as equivalent by defining __eq__ and __hash__.

And this pretty much works; if I define a Car class, I probably consider two cars with identical attributes to be representing two different cars. If I have a dictionary mapping cars to registered owners, I don't want to find Alice when I look up Bob's car, even if Alice and Bob happen to own identical cars! OTOH, if I define a class to represent postal codes, I probably do want to consider two different objects with the same code to be interchangeable representations of "the same" thing, and in this case if I had a dictionary mapping postal codes to states, I would clearly want to be able to find the same state with two different objects representing the same post code.

I refer to this as the difference between "value types" and "object types". Value types represent some value, and it's the value I care about, not each individual object's identity. Two different ways of coming up with the same value are equally good, and the "contract" of code passing around value types usually just promises to give you an object with some value, without specifying which particular object it is. For object types OTOH, each individual instance has its own identity, even if it contains exactly the same data as another instance. The "contract" of code passing around object types usually promises to keep track of the exact individual objects.

So why don't the built-in mutable classes use their id as their hash? It's because they're all containers, and we usually consider containers to be mostly like value types, with their value determined by the contained elements:

>>> [1, 2, 3] == [1, 2, 3]
True
>>> {f: 10} == {f: 10}
True

But mutable containers have a value that is transient. Some given list currently has the value [1, 2, 3], but it can be mutated into having the value [4, 5, 6]. If you could use lists as dictionary keys, then we'd have to make a ruling on whether lookup should use the list's (current) value, or its identity. Either way we can be (very) surprised when the value of an object currently being used as a dictionary key is changed by mutating it. Using objects as dictionary keys only works well when the object's value is its identity, or when an object's identity is irrelevant to its value. So the answer chosen by Python is to declare mutable containers unhashable.


Now, more specific details in answer to your direct questions:

1) Since this default hash in CPython (though apparently only < 2.6, according to other answers/comments) maps to the object's memory address, then in CPython no two objects using default hashing that are both live at the same time can possibly clash on their hash values, regardless of the classes involved (and if it's being stored as a dictionary key it's live). I would also expect that other Python implementations that don't use memory addresses as hashes should still have fine hash distributions among objects using the default hashing. So yes, you can rely on it.

2) So long as you don't return as your custom hash a result that is exactly the hash of some existing object, you should be relatively fine. My understanding is that Python's hash-based containers are relatively tolerant of sub-optimal hash functions, so long as they're not completely degenerate.

Caenogenesis answered 4/7, 2012 at 8:5 Comment(0)
S
4
>>> class C(object):
...     pass
... 
>>> c = C()
>>> hash(c) == id(c)
True

See function id

Shamikashamma answered 4/7, 2012 at 7:30 Comment(4)
I get False on Python 2.7 and 3.2, but True on Python 2.6.Johny
Older versions of CPython just used the value of id() directly for the default hash(), newer versions use id()/16 because in CPython all ids are a multiple of 16 and you want the low bits set. This is purely an implementation detail: the default hash() is generated from id() but exactly how changes between releases. In Python 3.3 there won't even be a fixed relationship between id() and hash().Cameron
@Cameron Why divide by 16? Why are all ids a multiple of 16, and want the low bits set?Broads
Python objects are aligned to the memory word size, so for a 64 bit interpreter that means each object is aligned to a multiple of 16. The id is simply the memory address so all ids are a multiple of 16. This matters because when you get a hash collision a new hash is calculated from the old one using the lowest 5 bits, if 4 of those are always 0 then there is a high chance that you get another collision.Cameron
L
-3
>>> class C(object):
...     pass
... 
>>> c = C()
>>> hash(c) == id(c)
False
>>> hash(c) == id(c)/16
True

Divided by 16 gives True

Lissa answered 27/8, 2015 at 9:17 Comment(1)
Duplicating an answer posted 3 years before you is hardly useful.Aurora

© 2022 - 2024 — McMap. All rights reserved.