More efficient way to look up dictionary values whose keys start with same prefix
Asked Answered
M

3

6

I have a dictionary whose keys come in sets that share the same prefix, like this:

d = { "key1":"valA", "key123":"valB", "key1XY":"valC",
      "key2":"valD", "key2-22":"valE" }

Given a query string, I need to look up all the values associated with keys that start with that prefix, e.g. for query="key1" I need to get ["valA", "valB", "valC"]

My implementation below works but is too slow for a large number of queries since the dictionary d has about 30,000 keys and most of the keys are more than 20 characters long:

result = [d[s] for s in d.keys() if s.startswith(query)]

Is there a faster/more efficient way to implement this?

Millepore answered 5/8, 2015 at 19:33 Comment(4)
I don't think so - keys are hashed, which means that keys of key1 and key12 may have totally different representations in the lookup table.Iphagenia
@PadraicCunningham Unless it's Python3 then keys actually returns an iterator if I'm not wrong.. just like Python2's ikeysAlkalinize
If you want efficiency then go for a database engine (Mysql for example) these are designed to retrieve data as efficiently as possible.Grayback
let us know what approach is faster in your case -- you can post your own answer with the comparisonManoff
B
9

You can avoid producing the intermediate list generated by dict.keys() (in python 2.x):

result = [d[key] for key in d if key.startswith(query)]

But you most likely want to use a trie instead of a dictionary, so you can find all the values associated with a key with a common prefix (a trie is similar to a tree based on prefixes).

Here you can find some different implementation of tries.

A trie for keys "A", "to", "tea", "ted", "ten", "i", "in", and "inn".

A trie for keys "A", "to", "tea", "ted", "ten", "i", "in", and "inn". (source wikipedia)


Let's compare the timings for the different solutions:

# create a dictionary with 30k entries
d = {str(x):str(x) for x in xrange(1, 30001)}
query = '108'

# dict with keys()
%timeit [d[s] for s in d.keys() if s.startswith(query)]

    100 loops, best of 3: 8.87 ms per loop

# dict without keys()
%timeit [d[s] for s in d if s.startswith(query)]

    100 loops, best of 3: 7.83 ms per loop

# 11.72% improvement

# PyTrie (https://pypi.python.org/pypi/PyTrie/0.2)
import pytrie
pt = pytrie.Trie(d)

%timeit [pt[s] for s in pt.iterkeys(query)]

    1000 loops, best of 3: 320 µs per loop

# 96.36% improvement

# datrie (https://pypi.python.org/pypi/datrie/0.7)
import datrie
dt = datrie.Trie('0123456789')
for key, val in d.iteritems():
    dt[unicode(key)] = val

%timeit [dt[s] for s in dt.keys(unicode(query))]

    10000 loops, best of 3: 162 µs per loop

# 98.17% improvement
Bobwhite answered 5/8, 2015 at 19:36 Comment(3)
In the dict approach, I would stick with deferring the value access until the key is confirmed to be valid (i.e., [d[s] for s in d if s.startswith(...)]).Lukasz
@chepner: yes, I think it could be more efficient, especially if, as expected, there are not many matches.Bobwhite
Your comparison does not include the trie construction time or how much memory does it require. Could add the comparison with SuffixTree.SubstringDict(b'0123456789')? Here's comparison for a similar problemManoff
C
0

The sortedContainers lib has a SortedDict implementation, once you have sorted dict you can bisect_left to find where to start, bisect_right to find the last position then use irange to get the keys in the range:

from sortedcontainers import SortedDict
from operator import itemgetter
from itertools import takewhile


d = { "key1":"valA", "key123":"valB", "key1XY":"valC",
  "key2":"valD", "key2-22":"valE","key3":"foo" }

key = "key2"
d = SortedDict(sorted(d.items(), key=itemgetter(0)))
start = d.bisect_left(key)
print([d[key] for key in takewhile(lambda x: x.startswith("key2"), d.irange(d.iloc[start]]))
['valD', 'valE']

Once you maintain a sorteddict using the sorteddict is a lot more efficint:

In [68]: l = ["key{}".format(randint(1,1000000)) for _ in range(100000)] 
In [69]: l.sort()    
In [70]: d = SortedDict(zip(l,range(100000)))

In [71]: timeit [d[s] for s in d.keys() if s.startswith("key2")]
10 loops, best of 3: 124 ms per loop

In [72]: timeit [d[s] for s in d if s.startswith("key2")]
10 loops, best of 3: 24.6 ms per loop

In [73]: %%timeit
key = "key2"
start = d.bisect_left(key)
l2 =[d[k] for k in takewhile(lambda x: x.startswith("key2"),d.irange(d.iloc[start]))]
   ....: 

100 loops, best of 3: 5.57 ms per loop
Chemo answered 5/8, 2015 at 20:12 Comment(0)
M
0

You could use a suffix tree:

#!/usr/bin/env python2
from SuffixTree import SubstringDict # $ pip install https://github.com/JDonner/SuffixTree/archive/master.zip

d = { "key1":"valA", "key123":"valB", "key1XY":"valC",
      "key2":"valD", "key2-22":"valE" }

a = '\n' # anchor
prefixes = SubstringDict()
for key, value in d.items(): # populated the tree *once*
    prefixes[a + key] = value # assume there is no '\n' in key

for query in ["key1", "key2"]: # perform queries
    print query, prefixes[a + query]

Output

key1 ['valC', 'valA', 'valB']
key2 ['valE', 'valD']
Manoff answered 6/8, 2015 at 4:1 Comment(1)
The author of this particular suffix tree has declared his project dead, with the comment "there are plenty of other suffix tree implementations in Python". Trouble is, the ones I've tried so far don't expose a convenient dict-like object.Lateritious

© 2022 - 2024 — McMap. All rights reserved.