Python multiprocessing PicklingError: Can't pickle <type 'function'>
Asked Answered
A

10

381

I am sorry that I can't reproduce the error with a simpler example, and my code is too complicated to post. If I run the program in IPython shell instead of the regular Python, things work out well.

I looked up some previous notes on this problem. They were all caused by using pool to call function defined within a class function. But this is not the case for me.

Exception in thread Thread-3:
Traceback (most recent call last):
  File "/usr/lib64/python2.7/threading.py", line 552, in __bootstrap_inner
    self.run()
  File "/usr/lib64/python2.7/threading.py", line 505, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/lib64/python2.7/multiprocessing/pool.py", line 313, in _handle_tasks
    put(task)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed

I would appreciate any help.

Update: The function I pickle is defined at the top level of the module. Though it calls a function that contains a nested function. i.e, f() calls g() calls h() which has a nested function i(), and I am calling pool.apply_async(f). f(), g(), h() are all defined at the top level. I tried simpler example with this pattern and it works though.

Ad answered 10/1, 2012 at 14:28 Comment(1)
The top-level / accepted answer is good, but it could mean you need to re-structure your code, which might be painful. I would recommend for anyone who has this issue to also read the additional answers utilising dill and pathos. However, I no luck with any of the solutions when working with vtkobjects :( Anyone has managed to run python code in parallel processing vtkPolyData?Piapiacenza
S
432

Here is a list of what can be pickled. In particular, functions are only picklable if they are defined at the top-level of a module.

This piece of code:

import multiprocessing as mp

class Foo():
    @staticmethod
    def work(self):
        pass

if __name__ == '__main__':   
    pool = mp.Pool()
    foo = Foo()
    pool.apply_async(foo.work)
    pool.close()
    pool.join()

yields an error almost identical to the one you posted:

Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 505, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 315, in _handle_tasks
    put(task)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed

The problem is that the pool methods all use a mp.SimpleQueue to pass tasks to the worker processes. Everything that goes through the mp.SimpleQueue must be pickable, and foo.work is not picklable since it is not defined at the top level of the module.

It can be fixed by defining a function at the top level, which calls foo.work():

def work(foo):
    foo.work()

pool.apply_async(work,args=(foo,))

Notice that foo is pickable, since Foo is defined at the top level and foo.__dict__ is picklable.

Selfforgetful answered 10/1, 2012 at 14:54 Comment(5)
Thanks for your reply. I updated my question. I don't htink that's the cause, thoughAd
To get a PicklingError something must be put on the Queue which is not picklable. It could be the function or its arguments. To find out more about the problem, I suggest make a copy of your program, and start paring it down, making it simpler and simpler, each time re-running the program to see if the problem remains. When it becomes really simple, you'll either have discovered the problem yourself, or will have something which you can post here.Selfforgetful
Also: if you define a function at the top-level of a module, but it's decorated, then the reference will be to the output of the decorator, and you'll get this error anyway.Pokelogan
Only late by 5 years, but I've just run into this. It turns out that "top level" has to be taken more literally than usual: it seems to me that the function definition has to precede the initialization of the pool (i.e. the pool = Pool() line here). I didn't expect that, and this might be the reason why OP's problem persisted.Imparisyllabic
In particular, functions are only picklable if they are defined at the top-level of a module. It appears that the result of applying functool.partial to a top-level function is also pickle-able, even if it's defined inside another function.Bellwort
R
142

I'd use pathos.multiprocesssing, instead of multiprocessing. pathos.multiprocessing is a fork of multiprocessing that uses dill. dill can serialize almost anything in python, so you are able to send a lot more around in parallel. The pathos fork also has the ability to work directly with multiple argument functions, as you need for class methods.

>>> from pathos.multiprocessing import ProcessingPool as Pool
>>> p = Pool(4)
>>> class Test(object):
...   def plus(self, x, y): 
...     return x+y
... 
>>> t = Test()
>>> p.map(t.plus, x, y)
[4, 6, 8, 10]
>>> 
>>> class Foo(object):
...   @staticmethod
...   def work(self, x):
...     return x+1
... 
>>> f = Foo()
>>> p.apipe(f.work, f, 100)
<processing.pool.ApplyResult object at 0x10504f8d0>
>>> res = _
>>> res.get()
101

Get pathos (and if you like, dill) here: https://github.com/uqfoundation

Rimarimas answered 25/1, 2014 at 1:34 Comment(12)
worked a treat. For anyone else, I installed both libraries through: sudo pip install git+https://github.com/uqfoundation/dill.git@master and sudo pip install git+https://github.com/uqfoundation/pathos.git@masterReckless
@AlexanderMcFarlane I wouldn't install python packages with sudo (from external sources such as github especially). Instead, I would recommend to run: pip install --user git+...Piapiacenza
Using just pip install pathos does not work sadly and gives this message: Could not find a version that satisfies the requirement pp==1.5.7-pathos (from pathos)Nicaea
Yes. I have not released in a while as I have been splitting up the functionality into separate packages, and also converting to 2/3 compatible code. Much of the above has been modularized in multiprocess which is 2/3 compatible. See #27873593 and pypi.python.org/pypi/multiprocess.Rimarimas
pip install pathos now works, and pathos is python 3 compatible.Rimarimas
please notice that the error PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed may also happen with pathos/dill if for some reason it is forced to fall back to standard multiprocess/pickle (see github.com/uqfoundation/pathos/issues/67)Giulietta
I wonder if you know about attempts to get this to the CPython? did core devs reject it for some reason? (I mean dill and pathos)Eucalyptus
@MikeMcKerns before I saw this answer (to use pathos), as a test I went into my installed python3.7 multiprocessing library, found the line where it was importing pickle (in lib/python3.7/multiprocessing/reduction.py) and simply changed import pickle to import dill as pickle and that did the trick! Is there anything more than that different between pathos and multiprocessing?Bonnie
@DanielGoldfarb: multiprocess is a fork of multiprocessing where dill has replaced pickle in several places in the code... but essentially, that's it. pathos provides some additional API layers on multiprocess and also has additional backends. But, that's the gist of it.Rimarimas
@MikeMcKerns Thanks! Sounds good. So I can assume then that multiprocess implements all the same APIs as multiprocessing making for a nice plug and play replacement (just like dill vs pickle). Nice work! Much appreciated.Bonnie
How do I deal with Pool Not found errorOldie
@ShivangiSingh: I assume you mean as in this post https://mcmap.net/q/83483/-pathos-pool-not-running/2379433?Rimarimas
T
67

When this problem comes up with multiprocessing a simple solution is to switch from Pool to ThreadPool. This can be done with no change of code other than the import-

from multiprocessing.pool import ThreadPool as Pool

This works because ThreadPool shares memory with the main thread, rather than creating a new process- this means that pickling is not required.

The downside to this method is that python isn't the greatest language with handling threads- it uses something called the Global Interpreter Lock to stay thread safe, which can slow down some use cases here. However, if you're primarily interacting with other systems (running HTTP commands, talking with a database, writing to filesystems) then your code is likely not bound by CPU and won't take much of a hit. In fact I've found when writing HTTP/HTTPS benchmarks that the threaded model used here has less overhead and delays, as the overhead from creating new processes is much higher than the overhead for creating new threads and the program was otherwise just waiting for HTTP responses.

So if you're processing a ton of stuff in python userspace this might not be the best method.

Trichromatism answered 17/11, 2019 at 3:34 Comment(6)
But then you're only using one CPU (at least with regular Python versions that use the GIL), which kind of defeats the purpose.Metonym
That really depends on what the purpose is. The Global Interpreter Lock does mean that only one instance at a time can run python code, but for actions that heavily block (file system access, downloading large or multiple files, running external code) the GIL ends up being a non-issue. In some cases the overhead from opening new processes (rather than threads) outweighs the GIL overhead.Trichromatism
That's true, thanks. Still you might want to include a caveat in the answer. These days when processing power increases mostly come in the form of more rather than more powerful CPU cores, switching from multicore to single-core execution is a rather significant side effect.Metonym
Good point- I've updated the answer with more details. I do want to point out though that switching to threaded multiprocessing does not make python only function on a single core.Trichromatism
... just to add that in my case (HTTP requests on an API) it worked like a charm (while pathos or other methods didn't)... and having only to change one line of import is really light impact on the code. Thanks a lot @RobertHafnerIndivisible
Thanks a lot for your answer! It works for me by just switch from Pool to ThreadPool.Oviparous
J
42

As others have said multiprocessing can only transfer Python objects to worker processes which can be pickled. If you cannot reorganize your code as described by unutbu, you can use dills extended pickling/unpickling capabilities for transferring data (especially code data) as I show below.

This solution requires only the installation of dill and no other libraries as pathos:

import os
from multiprocessing import Pool

import dill


def run_dill_encoded(payload):
    fun, args = dill.loads(payload)
    return fun(*args)


def apply_async(pool, fun, args):
    payload = dill.dumps((fun, args))
    return pool.apply_async(run_dill_encoded, (payload,))


if __name__ == "__main__":

    pool = Pool(processes=5)

    # asyn execution of lambda
    jobs = []
    for i in range(10):
        job = apply_async(pool, lambda a, b: (a, b, a * b), (i, i + 1))
        jobs.append(job)

    for job in jobs:
        print job.get()
    print

    # async execution of static method

    class O(object):

        @staticmethod
        def calc():
            return os.getpid()

    jobs = []
    for i in range(10):
        job = apply_async(pool, O.calc, ())
        jobs.append(job)

    for job in jobs:
        print job.get()
Jacobean answered 10/7, 2014 at 9:56 Comment(12)
I'm the dill and pathos author… and while you are right, isn't it so much nicer and cleaner and more flexible to also use pathos as in my answer? Or maybe I'm a little biased…Rimarimas
I was not aware about the status of pathos at the time of writing and wanted to present a solution which is very near to the answer. Now that I've seen your solution I agree that this is the way to go.Jacobean
I read your solution and was like, Doh… I didn't even think of doing it like that. So that was kinda cool.Rimarimas
Thanks for posting, I used this approach for dilling/undilling arguments that could not be pickled: #27884074Recognize
@rocksportrocker. I am reading this example and cannot understand why there is explicit for loop. I would normally see parallel routine take a list and return a list without loop.Myriam
@Myriam his is just an example where you can see that lambda funcions can be passed to other processes. It is up to you to rewrite to get away with apply_async in favor of map_async, the strategy to use dill for pickling is the same.Jacobean
@Jacobean Thank you very much! This is what I thought too. I am trying to read through docs.python.org/2/library/multiprocessing.html and understand how I convert your case to something simple as the first example on the above page, but with lambda.Myriam
@rocksportrocker, I tried to use your solution and dill on _tkinter.tkapp object, but still getting the same TypeError: can't pickle _tkinter.tkapp objects. Is there any way that I can get around with that?Legofmutton
Although dill has extended pickle capabilities compared to pickle, some objects can not be pickled at all, e.g. file handles. One solutions is to implement __setstate__ and __getstate__ method to pickle / unpickle the necessary data attributes. See also stackoverflow.com/questions/1939058 and docs.python.org/3/library/pickle.html#handling-stateful-objectsJacobean
@MikeMcKerns I actually prefer this answer because there does not seem to be an alternative to multiprocessing.Process in pathosHygienist
@xiay: There is pathos.helpers.mp.Process, which is identical to multiprocess.Process.Rimarimas
@MikeMcKerns thanks for letting me know! Would be nice to have it covered in the documentation though.Hygienist
F
31

I have found that I can also generate exactly that error output on a perfectly working piece of code by attempting to use the profiler on it.

Note that this was on Windows (where the forking is a bit less elegant).

I was running:

python -m profile -o output.pstats <script> 

And found that removing the profiling removed the error and placing the profiling restored it. Was driving me batty too because I knew the code used to work. I was checking to see if something had updated pool.py... then had a sinking feeling and eliminated the profiling and that was it.

Posting here for the archives in case anybody else runs into it.

Foretime answered 31/10, 2012 at 4:25 Comment(2)
WOW, thanks for mentioning! It drove me nuts for the last hour or so; I tried everything up to a very simple example - nothing seemed to work. But I also had the profiler running through my batchfile :(Ceciliacecilio
Oh, can't thank you enough. This does sound so silly though, as it is so unexpected. I think it should be mentioned in the docs. All I had was an import pdb statement, and a simple top level function with just a pass was not 'pickle'able.Sapowith
S
6
Can't pickle <type 'function'>: attribute lookup __builtin__.function failed

This error will also come if you have any inbuilt function inside the model object that was passed to the async job.

So make sure to check the model objects that are passed doesn't have inbuilt functions. (In our case we were using FieldTracker() function of django-model-utils inside the model to track a certain field). Here is the link to relevant GitHub issue.

Staffordshire answered 26/5, 2017 at 11:11 Comment(0)
P
4

This solution requires only the installation of dill and no other libraries as pathos

def apply_packed_function_for_map((dumped_function, item, args, kwargs),):
    """
    Unpack dumped function as target function and call it with arguments.

    :param (dumped_function, item, args, kwargs):
        a tuple of dumped function and its arguments
    :return:
        result of target function
    """
    target_function = dill.loads(dumped_function)
    res = target_function(item, *args, **kwargs)
    return res


def pack_function_for_map(target_function, items, *args, **kwargs):
    """
    Pack function and arguments to object that can be sent from one
    multiprocessing.Process to another. The main problem is:
        «multiprocessing.Pool.map*» or «apply*»
        cannot use class methods or closures.
    It solves this problem with «dill».
    It works with target function as argument, dumps it («with dill»)
    and returns dumped function with arguments of target function.
    For more performance we dump only target function itself
    and don't dump its arguments.
    How to use (pseudo-code):

        ~>>> import multiprocessing
        ~>>> images = [...]
        ~>>> pool = multiprocessing.Pool(100500)
        ~>>> features = pool.map(
        ~...     *pack_function_for_map(
        ~...         super(Extractor, self).extract_features,
        ~...         images,
        ~...         type='png'
        ~...         **options,
        ~...     )
        ~... )
        ~>>>

    :param target_function:
        function, that you want to execute like  target_function(item, *args, **kwargs).
    :param items:
        list of items for map
    :param args:
        positional arguments for target_function(item, *args, **kwargs)
    :param kwargs:
        named arguments for target_function(item, *args, **kwargs)
    :return: tuple(function_wrapper, dumped_items)
        It returs a tuple with
            * function wrapper, that unpack and call target function;
            * list of packed target function and its' arguments.
    """
    dumped_function = dill.dumps(target_function)
    dumped_items = [(dumped_function, item, args, kwargs) for item in items]
    return apply_packed_function_for_map, dumped_items

It also works for numpy arrays.

Preceptive answered 27/9, 2015 at 12:13 Comment(0)
D
4

A quick fix is to make the function global

from multiprocessing import Pool


class Test:
    def __init__(self, x):
        self.x = x
    
    @staticmethod
    def test(x):
        return x**2


    def test_apply(self, list_):
        global r
        def r(x):
            return Test.test(x + self.x)

        with Pool() as p:
            l = p.map(r, list_)

        return l



if __name__ == '__main__':
    o = Test(2)
    print(o.test_apply(range(10)))
Danette answered 4/4, 2022 at 17:49 Comment(1)
Curious about the downsides here. It seems to be the easiest way to deal with avoiding pickling of pyqtSignals but what about memory management?Petulah
S
0

Building on @rocksportrocker solution, It would make sense to dill when sending and RECVing the results.

import dill
import itertools
def run_dill_encoded(payload):
    fun, args = dill.loads(payload)
    res = fun(*args)
    res = dill.dumps(res)
    return res

def dill_map_async(pool, fun, args_list,
                   as_tuple=True,
                   **kw):
    if as_tuple:
        args_list = ((x,) for x in args_list)

    it = itertools.izip(
        itertools.cycle([fun]),
        args_list)
    it = itertools.imap(dill.dumps, it)
    return pool.map_async(run_dill_encoded, it, **kw)

if __name__ == '__main__':
    import multiprocessing as mp
    import sys,os
    p = mp.Pool(4)
    res = dill_map_async(p, lambda x:[sys.stdout.write('%s\n'%os.getpid()),x][-1],
                  [lambda x:x+1]*10,)
    res = res.get(timeout=100)
    res = map(dill.loads,res)
    print(res)
Saxony answered 24/7, 2019 at 20:0 Comment(0)
C
-2

As @penky Suresh has suggested in this answer, don't use built-in keywords.

Apparently args is a built-in keyword when dealing with multiprocessing


class TTS:
    def __init__(self):
        pass

    def process_and_render_items(self):
        multiprocessing_args = [{"a": "b", "c": "d"}, {"e": "f", "g": "h"}]

        with ProcessPoolExecutor(max_workers=10) as executor:
          # Using args here is fine. 
            future_processes = {
              executor.submit(TTS.process_and_render_item, args)
                for args in multiprocessing_args
            }

            for future in as_completed(future_processes):
                try:
                    data = future.result()
                except Exception as exc:
                    print(f"Generated an exception: {exc}")
                else:
                   print(f"Generated data for comment process: {future}")
 

    # Dont use 'args' here. It seems to be a built-in keyword.
    # Changing 'args' to 'arg' worked for me.
    def process_and_render_item(arg):
        print(arg)
      # This will print {"a": "b", "c": "d"} for the first process
      # and {"e": "f", "g": "h"} for the second process.



PS: The tabs/spaces maybe a bit off.

Chiastic answered 22/9, 2021 at 16:16 Comment(2)
This is a bad example. Code is incomplete. multiprocessing_args undefined, TTS undefined. It also has nothing to do with the question, which is related to pickling the function. You're also responding to a post that's 9 years old using python 2.7. If I could downvote this I would.Sumpter
@TLK3, you're right. I've modified the code and added comments. Hopefully it makes more sense now. I realize that I'm responding to an old post but people still look for newer answers in old posts.Chiastic

© 2022 - 2024 — McMap. All rights reserved.