How to set initial size for a dictionary in Python?
Asked Answered
R

5

31

I'm putting around 4 million different keys into a Python dictionary. Creating this dictionary takes about 15 minutes and consumes about 4GB of memory on my machine. After the dictionary is fully created, querying the dictionary is fast.

I suspect that dictionary creation is so resource consuming because the dictionary is very often rehashed (as it grows enormously). Is is possible to create a dictionary in Python with some initial size or bucket number?

My dictionary points from a number to an object.

class MyObject:
    def __init__(self):
        # some fields...

d = {}
d[i] = MyObject()  # 4M times on different key...
Rebarebah answered 19/8, 2009 at 9:6 Comment(5)
Very similar to #312275Fillmore
Can you let us know the source / format of your keys, so we can improve the anwsers ?Hedley
At the end I learn that my performance problem was NOT coming from dict initialization. Using slots solved the problems, see: #1337291Rebarebah
u should use a database and pull data from the db into a cache as neededCollazo
The memory consumption would depend on the objects themselves being stored (keys and values), not just the dict structure itself. Frequent rehashing wouldn't generally matter to the final memory consumption because when the dict resizes, the old table is garbage-collected. But if deletions occur, it's possible to end up with a dict using a much larger hash table than necessary.Harim
A
44

With performance issues it's always best to measure. Here are some timings:

 d = {}
 for i in xrange(4000000):
     d[i] = None
 # 722ms

 d = dict(itertools.izip(xrange(4000000), itertools.repeat(None)))
 # 634ms

 dict.fromkeys(xrange(4000000))
 # 558ms

 s = set(xrange(4000000))
 dict.fromkeys(s)
 # Not including set construction 353ms

The last option doesn't do any resizing, it just copies the hashes from the set and increments references. As you can see, the resizing isn't taking a lot of time. It's probably your object creation that is slow.

Aureliaaurelian answered 19/8, 2009 at 10:5 Comment(1)
It does not matter how I initialize the dictionary, filling it with data always takes a lot of time. Looks like indeed all time is spend on object creation. Thanks!Rebarebah
H
11

I tried :

a = dict.fromkeys((range(4000000)))

It creates a dictionary with 4 000 000 entries in about 3 seconds. After that, setting values are really fast. So I guess dict.fromkey is definitly the way to go.

Hedley answered 19/8, 2009 at 9:24 Comment(3)
+1 for mentioning dict.fromkeys(). Howevery, using range() to specify keys means that you end up with a dict of sequential keys. If that's what is required, why not just use a list? a = [None]*4000000Grandaunt
That was not direct solution, just a demonstration that you could use fromkeys to pre-generate the dict in a very sort time.Hedley
In line with the point @ShawnChin raises, what if you don't want numbers 1...4M as keys? Or in more general terms, what if you don't know your keys in advance, but you just know that they are in millions?Predecease
T
8

If you know C, you can take a look at dictobject.c and the Notes on Optimizing Dictionaries. There you'll notice the parameter PyDict_MINSIZE:

PyDict_MINSIZE. Currently set to 8.

This parameter is defined in dictobject.h. So you could change it when compiling Python but this probably is a bad idea.

Tyra answered 19/8, 2009 at 9:22 Comment(0)
E
4

You can try to separate key hashing from the content filling with dict.fromkeys classmethod. It'll create a dict of a known size with all values defaulting to either None or a value of your choice. After that you could iterate over it to fill with the values. It'll help you to time the actual hashing of all keys. Not sure if you'd be able significantly increase the speed though.

Ewe answered 19/8, 2009 at 9:14 Comment(0)
G
2

If your datas need/can be stored on disc perhaps you can store your datas in a BSDDB database or use Cpickle to load/store your dictionnary

Grumble answered 19/8, 2009 at 9:17 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.