Optimizing multiprocessing.Pool with expensive initialization
Asked Answered
G

2

12

Here is a complete simple working example

import multiprocessing as mp
import time
import random


class Foo:
    def __init__(self):
        # some expensive set up function in the real code
        self.x = 2
        print('initializing')

    def run(self, y):
        time.sleep(random.random() / 10.)
        return self.x + y


def f(y):
    foo = Foo()
    return foo.run(y)


def main():
    pool = mp.Pool(4)
    for result in pool.map(f, range(10)):
        print(result)
    pool.close()
    pool.join()


if __name__ == '__main__':
    main()

How can I modify it so Foo is only initialized once by each worker, not every task? Basically I want the init called 4 times, not 10. I am using python 3.5

Geelong answered 5/8, 2016 at 18:37 Comment(4)
Would it be fine if the class was initialized just once, and then copied to each worker?Bloodshed
@BrendanAbel I think so. That means the object must be pickleable? The object is never mutated after initialization so I don't know why copying would be badGeelong
Multiprocessing is not the same as multithreading. They have vastly different characteristics.Bracteole
Sorry for the confusion in the question titleGeelong
I
16

The intended way to deal with things like this is via the optional initializer and initargs arguments to the Pool() constructor. They exist precisely to give you a way to do stuff exactly once when a worker process is created. So, e.g., add:

def init():
    global foo
    foo = Foo()

and change the Pool creation to:

pool = mp.Pool(4, initializer=init)

If you needed to pass arguments to your per-process initialization function, then you'd also add an appropriate initargs=... argument.

Note: of course you should also remove the

foo = Foo()

line from f(), so that your function uses the global foo created by init().

Inoue answered 5/8, 2016 at 19:2 Comment(3)
Can you please explain the global keyword in this context. I saw the initializer in the docs but didn't think/know about "global" so I didn't see how to make it work. ThanksGeelong
See my edit just now: you also need to use the foo created by the initialization function. The initialization function runs once and ends, so any change it makes has to be visible at global scope so that other functions later (such as invoked by map()) can benefit.Inoue
No. Nothing is shared across processes. global has been in Python since day 1, many years before multiprocessing was even an idea for a module. global has nothing to do with processes (or threads). In context, it's simply telling init() to bind foo in the module's global scope instead of in (the default) init's local scope. In multiprocessing, each process has its own, distinct module global namespace.Inoue
C
3

most obvious, lazy load

_foo = None
def f(y):
    global _foo
    if not _foo:
       _foo = Foo()
    return _foo.run(y)
Clownery answered 5/8, 2016 at 18:49 Comment(1)
Why you have a global _foo that is never referenced?Barone

© 2022 - 2024 — McMap. All rights reserved.