What can multiprocessing and dill do together?
Asked Answered
V

4

59

I would like to use the multiprocessing library in Python. Sadly multiprocessing uses pickle which doesn't support functions with closures, lambdas, or functions in __main__. All three of these are important to me

In [1]: import pickle

In [2]: pickle.dumps(lambda x: x)
PicklingError: Can't pickle <function <lambda> at 0x23c0e60>: it's not found as __main__.<lambda>

Fortunately there is dill a more robust pickle. Apparently dill performs magic on import to make pickle work

In [3]: import dill

In [4]: pickle.dumps(lambda x: x)
Out[4]: "cdill.dill\n_load_type\np0\n(S'FunctionType'\np1 ...

This is very encouraging, particularly because I don't have access to the multiprocessing source code. Sadly, I still can't get this very basic example to work

import multiprocessing as mp
import dill

p = mp.Pool(4)
print p.map(lambda x: x**2, range(10))

Why is this? What am I missing? Exactly what are the limitations on the multiprocessing+dill combination?

Temporary Edit for J.F Sebastian

mrockli@mrockli-notebook:~/workspace/toolz$ python testmp.py 
    Temporary Edit for J.F Sebastian

mrockli@mrockli-notebook:~/workspace/toolz$ python testmp.py 
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/home/mrockli/Software/anaconda/lib/python2.7/threading.py", line 808, in __bootstrap_inner
    self.run()
  File "/home/mrockli/Software/anaconda/lib/python2.7/threading.py", line 761, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/mrockli/Software/anaconda/lib/python2.7/multiprocessing/pool.py", line 342, in _handle_tasks
    put(task)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed

^C
...lots of junk...

[DEBUG/MainProcess] cleaning up worker 3
[DEBUG/MainProcess] cleaning up worker 2
[DEBUG/MainProcess] cleaning up worker 1
[DEBUG/MainProcess] cleaning up worker 0
[DEBUG/MainProcess] added worker
[DEBUG/MainProcess] added worker
[INFO/PoolWorker-5] child process calling self.run()
[INFO/PoolWorker-6] child process calling self.run()
[DEBUG/MainProcess] added worker
[INFO/PoolWorker-7] child process calling self.run()
[DEBUG/MainProcess] added worker
[INFO/PoolWorker-8] child process calling self.run()Exception in thread Thread-2:
Traceback (most recent call last):
  File "/home/mrockli/Software/anaconda/lib/python2.7/threading.py", line 808, in __bootstrap_inner
    self.run()
  File "/home/mrockli/Software/anaconda/lib/python2.7/threading.py", line 761, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/mrockli/Software/anaconda/lib/python2.7/multiprocessing/pool.py", line 342, in _handle_tasks
    put(task)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed

^C
...lots of junk...

[DEBUG/MainProcess] cleaning up worker 3
[DEBUG/MainProcess] cleaning up worker 2
[DEBUG/MainProcess] cleaning up worker 1
[DEBUG/MainProcess] cleaning up worker 0
[DEBUG/MainProcess] added worker
[DEBUG/MainProcess] added worker
[INFO/PoolWorker-5] child process calling self.run()
[INFO/PoolWorker-6] child process calling self.run()
[DEBUG/MainProcess] added worker
[INFO/PoolWorker-7] child process calling self.run()
[DEBUG/MainProcess] added worker
[INFO/PoolWorker-8] child process calling self.run()
Verde answered 14/11, 2013 at 17:19 Comment(7)
have you tried to guard pool with if __name__ == "__main__":Jamshedpur
@J.F.Sebastian yes, with no change. To be explicit I've placed that line both before and after p = mp.Pool(4) with no change in result.Verde
1. add the actual code (with the guard) 2. is there a traceback? 3. enable logging: mp.log_to_stderr().setLevel(logging.DEBUG)Jamshedpur
Try importing dill first.Dogma
@J.F.Sebastian see edit with tracebackVerde
@Dogma Tried this to no avail. In general please assume that I've tried all simple transpositions of this code. This example is designed to be simple enough so that commenters could run it on their own machines. That would probably have a tighter feedback loop.Verde
For people coming here wanting to pickle functions in main, and for whon the accepted answer would be overkill: one can just assign __name__ in the __main__ module to its importable name, and pickle will just work. i.e. if __name__ == "__main__": __name__ = "module_name_as_used_in_import"Horbal
U
62

multiprocessing makes some bad choices about pickling. Don't get me wrong, it makes some good choices that enable it to pickle certain types so they can be used in a pool's map function. However, since we have dill that can do the pickling, multiprocessing's own pickling becomes a bit limiting. Actually, if multiprocessing were to use pickle instead of cPickle... and also drop some of it's own pickling overrides, then dill could take over and give a much more full serialization for multiprocessing.

Until that happens, there's a fork of multiprocessing called pathos (the release version is a bit stale, unfortunately) that removes the above limitations. Pathos also adds some nice features that multiprocessing doesn't have, like multi-args in the map function. Pathos is due for a release, after some mild updating -- mostly conversion to python 3.x.

Python 2.7.5 (default, Sep 30 2013, 20:15:49) 
[GCC 4.2.1 (Apple Inc. build 5566)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import dill
>>> from pathos.multiprocessing import ProcessingPool    
>>> pool = ProcessingPool(nodes=4)
>>> result = pool.map(lambda x: x**2, range(10))
>>> result
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

and just to show off a little of what pathos.multiprocessing can do...

>>> def busy_add(x,y, delay=0.01):
...     for n in range(x):
...        x += n
...     for n in range(y):
...        y -= n
...     import time
...     time.sleep(delay)
...     return x + y
... 
>>> def busy_squared(x):
...     import time, random
...     time.sleep(2*random.random())
...     return x*x
... 
>>> def squared(x):
...     return x*x
... 
>>> def quad_factory(a=1, b=1, c=0):
...     def quad(x):
...         return a*x**2 + b*x + c
...     return quad
... 
>>> square_plus_one = quad_factory(2,0,1)
>>> 
>>> def test1(pool):
...     print pool
...     print "x: %s\n" % str(x)
...     print pool.map.__name__
...     start = time.time()
...     res = pool.map(squared, x)
...     print "time to results:", time.time() - start
...     print "y: %s\n" % str(res)
...     print pool.imap.__name__
...     start = time.time()
...     res = pool.imap(squared, x)
...     print "time to queue:", time.time() - start
...     start = time.time()
...     res = list(res)
...     print "time to results:", time.time() - start
...     print "y: %s\n" % str(res)
...     print pool.amap.__name__
...     start = time.time()
...     res = pool.amap(squared, x)
...     print "time to queue:", time.time() - start
...     start = time.time()
...     res = res.get()
...     print "time to results:", time.time() - start
...     print "y: %s\n" % str(res)
... 
>>> def test2(pool, items=4, delay=0):
...     _x = range(-items/2,items/2,2)
...     _y = range(len(_x))
...     _d = [delay]*len(_x)
...     print map
...     res1 = map(busy_squared, _x)
...     res2 = map(busy_add, _x, _y, _d)
...     print pool.map
...     _res1 = pool.map(busy_squared, _x)
...     _res2 = pool.map(busy_add, _x, _y, _d)
...     assert _res1 == res1
...     assert _res2 == res2
...     print pool.imap
...     _res1 = pool.imap(busy_squared, _x)
...     _res2 = pool.imap(busy_add, _x, _y, _d)
...     assert list(_res1) == res1
...     assert list(_res2) == res2
...     print pool.amap
...     _res1 = pool.amap(busy_squared, _x)
...     _res2 = pool.amap(busy_add, _x, _y, _d)
...     assert _res1.get() == res1
...     assert _res2.get() == res2
...     print ""
... 
>>> def test3(pool): # test against a function that should fail in pickle
...     print pool
...     print "x: %s\n" % str(x)
...     print pool.map.__name__
...     start = time.time()
...     res = pool.map(square_plus_one, x)
...     print "time to results:", time.time() - start
...     print "y: %s\n" % str(res)
... 
>>> def test4(pool, maxtries, delay):
...     print pool
...     m = pool.amap(busy_add, x, x)
...     tries = 0
...     while not m.ready():
...         time.sleep(delay)
...         tries += 1
...         print "TRY: %s" % tries
...         if tries >= maxtries:
...             print "TIMEOUT"
...             break
...     print m.get()
... 
>>> import time
>>> x = range(18)
>>> delay = 0.01
>>> items = 20
>>> maxtries = 20
>>> from pathos.multiprocessing import ProcessingPool as Pool
>>> pool = Pool(nodes=4)
>>> test1(pool)
<pool ProcessingPool(ncpus=4)>
x: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]

map
time to results: 0.0553691387177
y: [0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289]

imap
time to queue: 7.91549682617e-05
time to results: 0.102381229401
y: [0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289]

amap
time to queue: 7.08103179932e-05
time to results: 0.0489699840546
y: [0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289]

>>> test2(pool, items, delay)
<built-in function map>
<bound method ProcessingPool.map of <pool ProcessingPool(ncpus=4)>>
<bound method ProcessingPool.imap of <pool ProcessingPool(ncpus=4)>>
<bound method ProcessingPool.amap of <pool ProcessingPool(ncpus=4)>>

>>> test3(pool)
<pool ProcessingPool(ncpus=4)>
x: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]

map
time to results: 0.0523059368134
y: [1, 3, 9, 19, 33, 51, 73, 99, 129, 163, 201, 243, 289, 339, 393, 451, 513, 579]

>>> test4(pool, maxtries, delay)
<pool ProcessingPool(ncpus=4)>
TRY: 1
TRY: 2
TRY: 3
TRY: 4
TRY: 5
TRY: 6
TRY: 7
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34]
Ungovernable answered 14/11, 2013 at 18:35 Comment(28)
So is the problem that multiprocessing uses cPickle rather than pickle causing dill to be unable to perform its usual magic?Verde
Yes, primarily. More recent versions of multiprocessing try to pickle certain types, but some of the picklings are not as (cough) thorough as those that dill does. So, my fork does two things (1) replace cPickle with pickle, and (2) delete a few of multiprocessing's attempts at pickling. I've also added some features to my fork like the multi-args map, but if multiprocessing did the above two things... then that would enable dill magic to happen in multiprocessing.Ungovernable
I haven't yet figured out how to make a cPickle "override"... Multiprocessing is the primary reason to do so, at least for the versions using cPickle. In python 3.x, there is no cPickle, so there should be less of a limitation that multiprocessing imposes on itself.Ungovernable
I'm glad that this application is in the mind of the dill developers. Multiprocessing is really crippled by the fragility of pickling. I suspect this issue stops many developers in their exploration of multiprocessing.Verde
This combination would be really useful for pymc. How stable/robust is this combination? Would you recommend using it in a library?Lifelike
dill plus pathos.multiprocessing has been used in combination for nearly 10 years in my research. It's been production stable for over five years. It's currently used on some of the largest compute clusters in the US (both DOE and corporate). It's pretty stable. As the author of both, my recommendation probably is meaningless. If you are concerned about the "alpha" designation the libraries carry, I'm removing that as I churn out releases this time. Your best option until I get a new release of pathos out is to build off of github.com/uqfoundation.Ungovernable
also, pip --pre or just plain old easy_install should get the latest of the dev releases for pathos. I'd suggest the github versions for now. I have a private svn, and the code that gets dumped to git is basically only the most stable branch.Ungovernable
To install the latest version from GitHub, I used pip install git+https://github.com/uqfoundation/pathosSelby
Strangely, this doesn't work in python 3 either, even though it uses pickle rather than cPickle. The same PicklingError exception is raised, so somehow mp manages to use the original _pickle module instead of dill despite my import dill as pickleRidge
@max: that would mean that pickle in 3.x is cPickle, where the name is now just pickle. I could see that. Sigh. Maybe I knew that and forgot. I will hopefully get the 3.x version of pathos.multiprocessing finished soon.Ungovernable
I've tried every option for installing Pathos- easy_install, the build directions on the pypi page, and $ pip install git+github.com/uqfoundation/pathos ...but with all of them, when I try from pathos.multiprocessing import ProcessingPool I get ImportError: No module named multiprocessing @Mike, I see you're constantly refining the code, but how do we use it?Spitfire
As @JoshRosen comments above, pip install does work. If it doesn't work for you, fill out an issue on the pathos github page.Ungovernable
@max: I have updated pathos to build and install for python 3.x, and a new release is imminent (i.e. then pip install pathos will work as expected).Ungovernable
This code still raises PicklingError: Can't pickle <type 'function'>: attribute lookup builtin.function failedAzores
@Yurii: I'm not getting any errors from the code. Are you on windows and without a C++ compiler? To figure out why you are seeing an error, I'll need more information. The error you are seeing is typical of a bad install of the multiprocess dependency.Ungovernable
@MikeMcKerns indeed I use Windows (not proud of it). What do u mean by "without a C++ compiler"? I didn't compile the code myself, but used pip to install pathos. I tried to install it both on top on Anaconda2 and on top of pure Python2, both ended in same exception. How do I fix "bad install"?Azores
I have not distributed wheels (but will start doing so for next releases), and only distribute source releases. So when pip installs, it has to build first. multiprocess needs a C++ compiler... and if it doesn't have it, it just uses multiprocessing. Most OS come with a C++ compiler, but windows doesn't, so on windows you have to install one. Microsoft has them available... you just need to go install it. In future releases, I will release wheels, so you won't need a compiler then. See windows instructions here: github.com/mmckerns/tuthpcUngovernable
Hm, I just installed latest pathos on Windows with pip install pathos. I can see in the log that all packages, including dill and multiprocess were installed using wheels. And yet I still get the error PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed when I try your example. Any idea what could I be doing wrong? Using Python 2.7Xenia
@nirvana-msu: I'm going to assume it's the same as my comment above. You need a C++ compiler for multiprocess on Windows. I didn't release a wheel for multiprocess, which means if you got one, it was auto-built from the python sources... and that means it didn't have the necessary C++ in it. I have to cut releases of all of my packages, and I will add a wheel in this next time around, as I said previously.Ungovernable
@MikeMcKerns I am having the same problem so I assume the wheel was not added yet. Precisely what dependencies are necessary? Your link specifies a whole slew of stuff, not just a C++ compiler.Deliadelian
@MikeMcKerns also some dependencies in your link are difficult to find. Google search on download "Python Tools 2.2 RC for Visual Studio 2015" yields three results, all related to pathos, none with a download link.Deliadelian
@MikeMcKerns and after installing everything on your page, I still have the same old pickling error when using pathosDeliadelian
@DaveKielpinski: The dependencies for pathos are dill, pox, ppft, and multiprocess. All of the above are pure python, with the exception of multiprocess... and multiprocess only needs a C compiler, with no other new dependencies. Yeah, I don't think I added a wheel in the latest release.Ungovernable
@MikeMcKerns Thanks. I installed all those, but either the build didn't work, or pathos can't handle my object.Deliadelian
@DaveKielpinski: Here's what you should do: (1) if you try import _multiprocess and it's not found, then you have a build problem; (2) if you can't pickle your object with dill.dumps, then you have a pickling problem.Ungovernable
@MikeMcKerns thanks. I checked this. I can import _multiprocess and I can also pickle my object (Microring) with dill.dumps. Unfortunately pathos still throws an error: multiprocess.pool.MaybeEncodingError: Error sending result: '[<data_analysis.pdk.Microring object at 0x000000000AE3AFD0>]'. Reason: 'PicklingError('args[0] from __newobj__ args has the wrong class',)'Deliadelian
@MikeMcKerns oh, and I can also unpickle my object with dill.loads.Deliadelian
@DaveKielpinski: then you should submit a ticket on thenpathos GitHub page, or a new issue here at SO.Ungovernable
P
12

Overwrite multiprocessing module Pickle class

import dill, multiprocessing
dill.Pickler.dumps, dill.Pickler.loads = dill.dumps, dill.loads
multiprocessing.reduction.ForkingPickler = dill.Pickler
multiprocessing.reduction.dump = dill.dump
multiprocessing.queues._ForkingPickler = dill.Pickler
Psychoneurosis answered 20/9, 2021 at 11:20 Comment(5)
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.Gynaeco
This one worked for me, thanks. I just had to remove the multiprocessing.queues._ForkingPickler = dill.Pickler as it apparently no longer exists.Purchase
Thanks, this still works with Python 3.10 and it handles well the use case of concurrent.futures.Cohabit
@Cohabit Can you expand on this? How do you accomplish this with concurrent.futures?Cultural
@Cultural you need to put this code before importing concurrent.futuresPsychoneurosis
C
2

You may want to try using the multiprocessing_on_dill library, which is a fork of multiprocessing that implements dill on the backend.

For example, you can run:

>>> import multiprocessing_on_dill as multiprocessing
>>> with multiprocessing.Pool() as pool:
...     pool.map(lambda x: x**2, range(10))
... 
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
Cavuoto answered 4/5, 2021 at 17:24 Comment(1)
Unmaintained, sadly. The last stable release of multiprocessing_on_dill was nearly a year ago. Yikes! It's pathos or nothing.Washerwoman
M
2

I know this thread is old, however, you don't necessarily have to use the pathos module as Mike McKerns pointed out. I also find it quite annoying that multiprocessing uses pickle instead of dill, so you can do something like this:

import multiprocessing as mp
import dill
def helperFunction(f, inp, *args, **kwargs):
    import dill # reimport, just in case this is not available on the new processes
    f = dill.loads(f) # converts bytes to (potentially lambda) function
    return f(inp, *args, **kwargs)
def mapStuff(f, inputs, *args, **kwargs):
    pool = mp.Pool(6) # create a 6-worker pool
    f = dill.dumps(f) # converts (potentially lambda) function to bytes
    futures = [pool.apply_async(helperFunction, [f, inp, *args], kwargs) for inp in inputs]
    return [f.get() for f in futures]

Then, you can use it like this:

mapStuff(lambda x: x**2, [2, 3]) # returns [4, 9]
mapStuff(lambda x, b: x**2 + b, [2, 3], 1) # returns [5, 10]
mapStuff(lambda x, b: x**2 + b, [2, 3], b=1) # also returns [5, 10]

def f(x):
    return x**2
mapStuff(f, [4, 5]) # returns [16, 25]

How it works is basically, you convert the lambda function to bytes object, pass that through to the child process, and have it reconstruct the lambda function. In the code, I have just used dill to serialize the function, but you can also serialize the arguments if need to.

Matthei answered 5/9, 2021 at 6:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.