How are Python's Built In Dictionaries Implemented?
Asked Answered
P

4

420

Does anyone know how the built in dictionary type for python is implemented? My understanding is that it is some sort of hash table, but I haven't been able to find any sort of definitive answer.

Pase answered 29/11, 2008 at 7:35 Comment(0)
A
731

Edit:

This answer is for versions of Python earlier than 3.6. For Python 3.6 and on see russia-must-remove-putin's answer below.

Original:

Here is everything about Python dicts that I was able to put together (probably more than anyone would like to know; but the answer is comprehensive).

  • Python dictionaries are implemented as hash tables.

  • Hash tables must allow for hash collisions i.e. even if two distinct keys have the same hash value, the table's implementation must have a strategy to insert and retrieve the key and value pairs unambiguously.

  • Python dict uses open addressing to resolve hash collisions (explained below) (see dictobject.c:296-297).

  • Python hash table is just a contiguous block of memory (sort of like an array, so you can do an O(1) lookup by index).

  • Each slot in the table can store one and only one entry. This is important.

  • Each entry in the table is actually a combination of the three values: < hash, key, value >. This is implemented as a C struct (see dictobject.h:51-56).

  • The figure below is a logical representation of a Python hash table. In the figure below, 0, 1, ..., i, ... on the left are indices of the slots in the hash table (they are just for illustrative purposes and are not stored along with the table obviously!).

      # Logical model of Python Hash table
      -+-----------------+
      0| <hash|key|value>|
      -+-----------------+
      1|      ...        |
      -+-----------------+
      .|      ...        |
      -+-----------------+
      i|      ...        |
      -+-----------------+
      .|      ...        |
      -+-----------------+
      n|      ...        |
      -+-----------------+
    
  • When a new dict is initialized it starts with 8 slots. (see dictobject.h:49)

  • When adding entries to the table, we start with some slot, i, that is based on the hash of the key. CPython initially uses i = hash(key) & mask (where mask = PyDictMINSIZE - 1, but that's not really important). Just note that the initial slot, i, that is checked depends on the hash of the key.

  • If that slot is empty, the entry is added to the slot (by entry, I mean, <hash|key|value>). But what if that slot is occupied!? Most likely because another entry has the same hash (hash collision!)

  • If the slot is occupied, CPython (and even PyPy) compares the hash AND the key (by compare I mean == comparison not the is comparison) of the entry in the slot against the hash and key of the current entry to be inserted (dictobject.c:337,344-345) respectively. If both match, then it thinks the entry already exists, gives up and moves on to the next entry to be inserted. If either hash or the key don't match, it starts probing.

  • Probing just means it searches the slots by slot to find an empty slot. Technically we could just go one by one, i+1, i+2, ... and use the first available one (that's linear probing). But for reasons explained beautifully in the comments (see dictobject.c:33-126), CPython uses random probing. In random probing, the next slot is picked in a pseudo random order. The entry is added to the first empty slot. For this discussion, the actual algorithm used to pick the next slot is not really important (see dictobject.c:33-126 for the algorithm for probing). What is important is that the slots are probed until first empty slot is found.

  • The same thing happens for lookups, just starts with the initial slot i (where i depends on the hash of the key). If the hash and the key both don't match the entry in the slot, it starts probing, until it finds a slot with a match. If all slots are exhausted, it reports a fail.

  • BTW, the dict will be resized if it is two-thirds full. This avoids slowing down lookups. (see dictobject.h:64-65)

NOTE: I did the research on Python Dict implementation in response to my own question about how multiple entries in a dict can have same hash values. I posted a slightly edited version of the response here because all the research is very relevant for this question as well.

Arenaceous answered 26/1, 2012 at 17:52 Comment(6)
You said, when both hash and the key match, it (insert op) gives up and moves on. Doesn't insert overwrite existing entry in this case?Walking
Thank you @Praveen for great explanation. I think it would be even better if you also provide an example for insertion, lookup and deletion in dictionary.Chaschase
@PraveenGollakota, thanks for the answer... a friend of mine raised the same question in today's CS live class while discussing dictionaries when he saw the error talking of unhashable types being passed as key values... I luckily found your answer and passed this to himMussel
Are key and value in an entry pointers to PyObjects (i.e. PyObject *)?Harty
thanks v much. but. hash of object is translated to an index. so obviously, inside the index, the hash should match the hash from object. if the key doesn't match, it means it is a hash collusion, and then probing is required.Fulvous
Isn't a dict resize very expensive?Tractate
B
127

How are Python's Built In Dictionaries Implemented?

Here's the short course:

  • They are hash tables. (See below for the specifics of Python's implementation.)
  • A new layout and algorithm, as of Python 3.6, makes them
    • ordered by key insertion, and
    • take up less space,
    • at virtually no cost in performance.
  • Another optimization saves space when dicts share keys (in special cases).

The ordered aspect is unofficial as of Python 3.6 (to give other implementations a chance to keep up), but official in Python 3.7.

Python's Dictionaries are Hash Tables

For a long time, it worked exactly like this. Python would preallocate 8 empty rows and use the hash to determine where to stick the key-value pair. For example, if the hash for the key ended in 001, it would stick it in the 1 (i.e. 2nd) index (like the example below.)

   <hash>       <key>    <value>
     null        null    null
...010001    ffeb678c    633241c4 # addresses of the keys and values
     null        null    null
      ...         ...    ...

Each row takes up 24 bytes on a 64 bit architecture, 12 on a 32 bit. (Note that the column headers are just labels for our purposes here - they don't actually exist in memory.)

If the hash ended the same as a preexisting key's hash, this is a collision, and then it would stick the key-value pair in a different location.

After 5 key-values are stored, when adding another key-value pair, the probability of hash collisions is too large, so the dictionary is doubled in size. In a 64 bit process, before the resize, we have 72 bytes empty, and after, we are wasting 240 bytes due to the 10 empty rows.

This takes a lot of space, but the lookup time is fairly constant. The key comparison algorithm is to compute the hash, go to the expected location, compare the key's id - if they're the same object, they're equal. If not then compare the hash values, if they are not the same, they're not equal. Else, then we finally compare keys for equality, and if they are equal, return the value. The final comparison for equality can be quite slow, but the earlier checks usually shortcut the final comparison, making the lookups very quick.

Collisions slow things down, and an attacker could theoretically use hash collisions to perform a denial of service attack, so we randomized the initialization of the hash function such that it computes different hashes for each new Python process.

The wasted space described above has led us to modify the implementation of dictionaries, with an exciting new feature that dictionaries are now ordered by insertion.

The New Compact Hash Tables

We start, instead, by preallocating an array for the index of the insertion.

Since our first key-value pair goes in the second slot, we index like this:

[null, 0, null, null, null, null, null, null]

And our table just gets populated by insertion order:

   <hash>       <key>    <value>
...010001    ffeb678c    633241c4 
      ...         ...    ...

So when we do a lookup for a key, we use the hash to check the position we expect (in this case, we go straight to index 1 of the array), then go to that index in the hash-table (e.g. index 0), check that the keys are equal (using the same algorithm described earlier), and if so, return the value.

We retain constant lookup time, with minor speed losses in some cases and gains in others, with the upsides that we save quite a lot of space over the pre-existing implementation and we retain insertion order. The only space wasted are the null bytes in the index array.

Raymond Hettinger introduced this on python-dev in December of 2012. It finally got into CPython in Python 3.6. Ordering by insertion was considered an implementation detail for 3.6 to allow other implementations of Python a chance to catch up.

Shared Keys

Another optimization to save space is an implementation that shares keys. Thus, instead of having redundant dictionaries that take up all of that space, we have dictionaries that reuse the shared keys and keys' hashes. You can think of it like this:

     hash         key    dict_0    dict_1    dict_2...
...010001    ffeb678c    633241c4  fffad420  ...
      ...         ...    ...       ...       ...

For a 64 bit machine, this could save up to 16 bytes per key per extra dictionary.

Shared Keys for Custom Objects & Alternatives

These shared-key dicts are intended to be used for custom objects' __dict__. To get this behavior, I believe you need to finish populating your __dict__ before you instantiate your next object (see PEP 412). This means you should assign all your attributes in the __init__ or __new__, else you might not get your space savings.

However, if you know all of your attributes at the time your __init__ is executed, you could also provide __slots__ for your object, and guarantee that __dict__ is not created at all (if not available in parents), or even allow __dict__ but guarantee that your foreseen attributes are stored in slots anyways. For more on __slots__, see my answer here.

See also:

Babism answered 12/6, 2017 at 21:54 Comment(9)
You said "we", and "to allow other implementations of Python a chance to catch up" - does this mean you "know things" and that that might become a permanent feature? Is there any downside to dicts being ordered by spec?Unpremeditated
The downside to being ordered is that if dicts are expected to be ordered they can't easily switch to a better/faster implementation that isn't ordered. It seems unlikely that will be the case though. I "know things" because I watch lots of talks and read lots of things written by core members and others with a better real-world reputation than me, so even if I don't have an immediately available source to cite, I usually know what I'm talking about. But I think you can get that point from one of Raymond Hettinger's talks.Babism
You explained somewhat vaguely how insertion works ("If the hash ended the same as a preexisting key's hash, ... then it would stick the key-value pair in a different location" -- any?), but you didn't explain how lookup and membership test work. It is not quite clear how the location is determined by the hash either, but i suppose that the size is always a power of 2, and you take the last few bits of the hash...Ceric
@Ceric The last link I provide gives you the well-annotated dict implementation - where you can find the function that does this, currently on line 969, called find_empty_slot: github.com/python/cpython/blob/master/Objects/dictobject.c#L969 - and starting on line 134 there's some prose that describes it.Babism
You said: After 5 key-values are stored the dictionary is doubled in size. In a 64 bit process, before the resize, we have 72 bytes empty, and after, we are wasting 240 bytes due to the 10 empty rows. Why not 264? 8-5+8=11, 11*24=264Sabec
Clear explanation. Thanks. Could you confirm that Python still uses linear probing for compact dicts in case of collision?Exscind
@Sabec No he said correct. After doubling in size, the table consists of (8*2)*24=384 bytes, 6 rows have been populated (6*24=144) so 384-144=240 bytes are empty.Exscind
To say that dictionaries remember the insertion order is extremely misleading. While this might be true in terms of how they are implemented, it is irrelevant because this implementation is hidden behind an API and not accessable to the client (user). For example: If I insert 5 items into a dictionary, how can I query the insertion order? The answer is you can't. Therefore it is wrong to say that dictionaries remember their insertion order - they don't. At least not as far as the API is concerned.Sholapur
From the dict docs: > Dictionaries preserve insertion order. Note that updating a key does not affect the order. Keys added after deletion are inserted at the end. [...] Changed in version 3.7: Dictionary order is guaranteed to be insertion order. This behavior was an implementation detail of CPython from 3.6.Babism
B
54

Python Dictionaries use Open addressing (reference inside Beautiful code)

NB! Open addressing, a.k.a closed hashing should, as noted in Wikipedia, not be confused with its opposite open hashing!

Open addressing means that the dict uses array slots, and when an object's primary position is taken in the dict, the object's spot is sought at a different index in the same array, using a "perturbation" scheme, where the object's hash value plays part.

Belgrade answered 8/6, 2010 at 11:0 Comment(1)
"not be confused with its opposite open hashing! (which we see in the accepted answer)." - I'm not sure which answer was accepted when you wrote that, or what that answer said at the time - but this parenthesised comment is not currently true of the accepted answer and would best be removed.Fastness
F
1

Python dict maintains two indexes now. One is a sparse array. This is what it does when first element is inserted in dict:

  • insert key value
  • dict finds hash of key
  • dict maps hash to an index
  • in the sparse array, this index is located and number zero is entered (first time you do an entry)

    The second array is a dense array. This is what happens there:
  • In zero index, enter the value

    Thus the second array is compact and memory efficient.
    For subsequent inserts, the insertion index of second array is incremented. This way memory is saved, and insertion order is maintained.
    Hash collusions can occur when inserting index in first sparse array. That is taken care of by pseudo random probing i.e. algo which looks at further ahead inside the array for empty slot in a predictable but pseudo random way.
    The second array is always compact.
Fulvous answered 9/8, 2023 at 17:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.