python: sharing huge dictionaries using multiprocessing

T

4

9

I'm processing very large amounts of data, stored in a dictionary, using multiprocessing. Basically all I'm doing is loading some signatures, stored in a dictionary, building a shared dict object out of it (getting the 'proxy' object returned by Manager.dict() ) and passing this proxy as argument to the function that has to be executed in multiprocessing.

Just to clarify:

signatures = dict()
load_signatures(signatures)
[...]
manager = Manager()
signaturesProxy = manager.dict(signatures)
[...]
result = pool.map ( myfunction , [ signaturesProxy ]*NUM_CORES )

Now, everything works perfectly if signatures is less than 2 million entries or so. Anyways, I have to process a dictionary with 5.8M keys (pickling signatures in binary format generates a 4.8 GB file). In this case, the process dies during the creation of the proxy object:

Traceback (most recent call last):
  File "matrix.py", line 617, in <module>
signaturesProxy = manager.dict(signatures)
  File "/usr/lib/python2.6/multiprocessing/managers.py", line 634, in temp
token, exp = self._create(typeid, *args, **kwds)
  File "/usr/lib/python2.6/multiprocessing/managers.py", line 534, in _create
id, exposed = dispatch(conn, None, 'create', (typeid,)+args, kwds)
  File "/usr/lib/python2.6/multiprocessing/managers.py", line 79, in dispatch
raise convert_to_error(kind, result)
multiprocessing.managers.RemoteError: 
---------------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.6/multiprocessing/managers.py", line 173, in handle_request
    request = c.recv()
EOFError
---------------------------------------------------------------------------

I know the data structure is huge but I'm working on a machine equipped w/ 32GB of RAM, and running top I see that the process, after loading the signatures, occupies 7GB of RAM. It then starts building the proxy object and the RAM usage goes up to ~17GB of RAM but never gets close to 32. At this point, the RAM usage starts diminishing quickly and the process terminates with the above error. So I guess this is not due to an out-of-memory error...

Any idea or suggestion?

Thank you,

Davide

Try answered 26/12, 2010 at 17:15 Comment(0)

T

-2

If the dictionaries are read-only, you don't need proxy objects in most operating systems.

Just load the dictionaries before starting the workers, and put them somewhere they'll be reachable; the simplest place is globally to a module. They'll be readable from the workers.

from multiprocessing import Pool

buf = ""

def f(x):
    buf.find("x")
    return 0

if __name__ == '__main__':
    buf = "a" * 1024 * 1024 * 1024
    pool = Pool(processes=1)
    result = pool.apply_async(f, [10])
    print result.get(timeout=5)

This only uses 1GB of memory combined, not 1GB for each process, because any modern OS will make a copy-on-write shadow of the data created before the fork. Just remember that changes to the data won't be seen by other workers, and memory will, of course, be allocated for any data you change.

It will use some memory: the page of each object containing the reference count will be modified, so it'll be allocated. Whether this matters depends on the data.

This will work on any OS that implements ordinary forking. It won't work on Windows; its (crippled) process model requires relaunching the entire process for each worker, so it's not very good at sharing data.

Tetrasyllable answered 26/12, 2010 at 18:31 Comment(5)

Does this work with Windows 7 (which is definitely a modern OS?) – Labyrinth 8/4, 2012 at 20:35

@Seun: I don't know; try testing it. I doubt its process model is any more modern than previous versions; Windows has always been in the dark ages about that. – Tetrasyllable 10/4, 2012 at 15:36

I don't think multiprocessing uses copy-on-write. In my experience, the data will be duplicated in every subprocess, even if it's read-only. This pose seems to confirm that: https://mcmap.net/q/167543/-multiprocessing-sharing-a-large-read-only-object-between-processes/5475 – Talia 24/4, 2013 at 10:54

Downvoted your answer but upvoted your comment (which I agree with!). :) – Talia 24/4, 2013 at 10:55

@Talia The python docs (quoted in one of the answers to that question) disagree with you and agree with Glenn. See docs.python.org/dev/library/… "Explicitly pass resources to child processes" – Barleycorn 22/3, 2017 at 11:13

R

6

Why don't you try this with a database? Databases are not limited to adressable/physical ram and are safe for multithread/process use.

Rhyne answered 26/12, 2010 at 17:27 Comment(0)

D

2

In the interest of saving time and not having to debug system-level issues, maybe you could split your 5.8 million record dictionary into three sets of ~2 million each, and run the job 3 times.

Duchy answered 26/12, 2010 at 17:24 Comment(2)

I could but it's not an optimal solution as, anyways, in the end I'd have to reconstruct the whole dictionary and use it for other operations – Try 26/12, 2010 at 17:48

Then it sounds like your task would be appropriate for Hadoop/MapReduce... Maybe you should check that out. – Duchy 26/12, 2010 at 18:18

W

0

I think the problem you were encountering was the dict or hash table resizing itself as it grows. Initially, the dict has a set number of buckets available. I'm not sure about Python, but I know Perl starts with 8 and then when the buckets are full, the hash is recreated by 8 more (ie. 8, 16, 32, ...).

The bucket is a landing location for the hash algorithm. The 8 slots do not mean 8 entries, it means 8 memory locations. When the new item is added, a hash is generated for that key, then its stored into that bucket.

This is where collisions come into play. The more items that are in a bucket, the slower the function will get, because items are appended sequentially due to dynamic sizing of the slot.

One problem that may occur is your keys are very similar and producing the same hash result - meaning a majority of keys are in one slot. Pre-allocating the hash buckets will help eliminate this and actually improve processing time and key management, plus it no longer needs to do all that swapping.

However, I think you are still limited to the amount of free contiguous memory and will eventually need to go to the database solution.

side note: I'm still new to Python, I know in Perl you can see hash stats by doing print %HASHNAME, it will show your distribution of bucket usage. Helps you identify collisions counts - incase you need to pre-allocate buckets. Can this be done in Python as well?

Rich

Welldone answered 21/2, 2012 at 17:43 Comment(0)

T

-2