How do Python dictionary hash lookups work?
Asked Answered
K

5

29

How do Python dictionary lookup algorithms work internally?

mydi['foo'] 

If the dictionary has 1,000,000 terms, is a tree search executed? Would I expect performance in terms of the length of the key string, or the size of the dictionary? Maybe stuffing everything into a dictionary is just as good as writing a tree search index for strings of size 5 million?

Koski answered 7/7, 2011 at 2:45 Comment(2)
While I can see how python dictionaries would work as described below, hashes in general are richer than this. One can imagine that this simple lookup will take a long time with a large dictionary. Perl hashes employ a system that is basically an index by pooling hash elements by each character of the key.Koski
see perl.com/pub/2002/10/01/hashes.htmlKoski
T
20

Here's some pseudo-code closer to what actually happens. Imagine the dictionary has a data attribute containing the key,value pairs and a size which is the number of cells allocated.

def lookup(d, key):
    perturb = j = hash(key)
    while True:
        cell = d.data[j % d.size]
        if cell.key is EMPTY:
            raise IndexError
        if cell.key is not DELETED and (cell.key is key or cell.key == key):
            return cell.value
        j = (5 * j) + 1 + perturb
        perturb >>= PERTURB

The perturb value ensures that all bits of the hash code are eventually used when resolving hash clashes but once it has degraded to 0 the (5*j)+1 will eventually touch all cells in the table.

size is always much larger than the number of cells actually used so the hash is guaranteed to eventually hit an empty cell when the key doesn't exist (and should normally hit one pretty quickly). There's also a deleted value for a key to indicate a cell which shouldn't terminate the search but which isn't currently in use.

As for your question about the length of the key string, hashing a string will look at all of the characters in the string, but a string also has a field used to store the computed hash. So if you use different strings every time to do the lookup the string length may have a bearing, but if you have a fixed set of keys and re-use the same strings the hash won't be recalculated after the first time it is used. Python gets a benefit from this as most name lookups involve dictionaries and a single copy of each variable or attribute name is stored internally, so every time you access an attribute x.y there is a dictionary lookup but not a call to a hash function.

Typhoeus answered 7/7, 2011 at 7:39 Comment(1)
i'm giving you the checkmark as its the most direct answer, not a link, even though everyone basically said the same thing.Koski
I
7

As you mentioned in your title, dicts are hash tables. No tree searching is used. Looking up a key is a nearly constant time operation, regardless of the size of the dict.

You might find the answers to this question helpful: How are Python's Built In Dictionaries Implemented

Izard answered 7/7, 2011 at 2:47 Comment(5)
+1, but instead of saying "nearly constant", why not "amortized constant"? Is the worst case time constant?Izmir
@Neil it's worst case linear time, if you get a set of input that somehow collides with every single input. However, even an adversary can't do that because universal hashes solve that.Kling
"nearly constant" is English for "amortized constant"! :)Izard
@bdares Universal hashes? I can only assume you're talking about perfect hashes, but they don't apply here.Mayberry
Could you explain how looking up a key works and why is it in (nearly) constant time? If a dict is a table of "hash, key, value" I do not see why one needs a "hash" column. Is the hash column sorted on the hash value? Why not sort the table on the "key" and search on the key?Butterfish
H
5

Here's a good explanation: http://wiki.python.org/moin/DictionaryKeys

Pseudocode from above link:

def lookup(d, key):
    '''dictionary lookup is done in three steps:
       1. A hash value of the key is computed using a hash function.

       2. The hash value addresses a location in d.data which is
          supposed to be an array of "buckets" or "collision lists"
          which contain the (key,value) pairs.

       3. The collision list addressed by the hash value is searched
          sequentially until a pair is found with pair[0] == key. The
          return value of the lookup is then pair[1].
    '''
    h = hash(key)                  # step 1
    cl = d.data[h]                 # step 2
    for pair in cl:                # step 3
        if key == pair[0]:
            return pair[1]
    else:
        raise KeyError, "Key %s not found." % key
Halona answered 7/7, 2011 at 2:48 Comment(2)
seems like a lot of work, but it seems to be good enough for most applications. The keys are not really even sorted like you might want from a sorted index. Thanks this is a helpful.Koski
Note that this Python code doesn't handle collisions the same way Python dicts do. Hash tables implementations can differ in how they deal with collisions.Izard
K
1

Hash lookups don't use trees. They use a hash table, and they take constant time lookup. They will take more space (on average I believe twice as much) as a tree, but the lookup and insert times are win.

To oversimplify, take an md5 of your key, and mod that with the number of addresses you have, and that's where you save or look to retrieve a key. It doesn't matter how big the set is, it will always take the same amount of time as long as you don't have significant collision, which a good hash will avoid.

Kling answered 7/7, 2011 at 2:47 Comment(5)
i guess it was simpler this way for sane dictionary sizes. I guess i'll be building my own tree search after all... benchmarking against a hash lookup will probably make me look good if this is the case.Koski
@Koski your real problem seems to be that you're trying to use memory-space data structure implementations for something that possibly doesn't fit comfortably into memory. I would suggest you use a DBMS.Kling
@shigeta: why are you building your own tree search? You seem to be implying that your tree will go faster than a dict, but that is unlikely. Even with 5Mb strings, each string is only hashed once.Izard
its for very long hashes- once you have a few hundred million keys, size matters for a hash. substrings in a genome in this case.Koski
@Koski you seem to have a misunderstanding of how a hash works. You could have an infinite number of keys, and the hash would still take constant time. Now, the key might be super-duper-long, and then should consider that it takes time linear to the length of the key to hash. But then again, so does a prefix tree lookup.Kling
D
0

Answer 1: Internal working is explained in this video

Answer 2: No, a tree search is not done if you have a million records in a dictionary.

Answer 3: As there might be key collisions you will expect performance in terms of the size of the dictionary, and not in terms of the length of the key string.

Answer 4: Consider the dictionary as an array (contiguous memory locations), but there might be blocks within the array which are not used. Hence, dictionaries tend to waste a lot of memory space as compared to trees. But, for better run-time performance dictionaries might be better than trees. Key collisions can sometime degrade performance. You should read about Consistent Hashing.

Demosthenes answered 7/7, 2011 at 6:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.