How to avoid double imports with the Python multiprocessing module?
Asked Answered
G

1

10

I'm trying to instantiate objects using imported modules. To make these imports process safe (since I'm on windows), I'm using the import statements inside the if __name__ == '__main__': block.

My files look somewhat like this:

main.py

# main.py

from multiprocessing import Process

# target func for new process
def init_child(foo_obj, bar_obj):
  pass

if __name__ == "__main__":
  # protect imports from child process
  from foo import getFoo
  from bar import getBar

  # get new objects
  foo_obj = getFoo()
  bar_obj = getBar()

  # start new process
  child_p = Process(target=init_child, args=(foo_obj, bar_obj))
  child_p.start()
  # Wait for process to join
  child_p.join()

foo.py

# foo.py
import os

print 'Foo Loaded by PID: ' + str(os.getpid())

class Foo:
  def __init__(self):
    pass

def getFoo():
  # returning new instance of class
  return Foo()

bar.py

# bar.py
import os

print 'Bar Loaded by PID: ' + str(os.getpid())

class Bar:
  def __init__(self):
    pass

def getBar():
  # not returning a new instance
  return 'bar'

output

Foo Loaded by PID: 58760
Bar Loaded by PID: 58760
Foo Loaded by PID: 29376

The output I get indicates that the foo module was loaded twice. I understand that the interpreter executes the main module again (since Windows does not support the fork system call), but what's odd is that it was imported inside the __main__ block.

It might be an issue when sharing objects; like Queues imported from a dedicated module. Any ideas what might cause this?

Thanks!

Grizel answered 8/2, 2018 at 7:44 Comment(0)
O
0

The reason this happens is because on Windows multiprocessing uses method spawn to start new processes. As you mentioned, unlike fork, this method duplicates everything the process requires to start the work in the process itself. Therefore, if you pass any objects as arguments to start a process, all required data held by the object is serialized using pickle, and sent to the process, where the object is recreated/duplicated using that data.

How does this affect your code?

You are passing an instance of class Foo as an argument to init_child using the Process class. What happens after this is that the data inside the instance, along with the full name and path of the class, is pickled and sent to the new process. Keep in mind, that the pickled data does not include any of the code, or class attributes of the class, just it's name. Then using the name of the class and the instance data, a new object is created, ready to be used.

As you may have noticed, this whole process will only work if the class Foo has already been defined in the new process, and to do that, the module that defines Foo must be imported their as well. Multiprocessing automatically handles all this for you, which is the reason for the ambiguity in your case.

To make things clearer, run this below code, which prints the PID of the worker process started as well, and creates 3 workers:

from multiprocessing import Process
import os


# target func for new process
def init_child(foo_obj, bar_obj):
    print(f"Worker Process loaded in PID: {os.getpid()}")
    pass


if __name__ == "__main__":
    # protect imports from child process
    from foo import getFoo
    from bar import getBar

    # get new objects
    foo_obj = getFoo()
    bar_obj = getBar()

for _ in range(3):
    print()
    # start new process
    child_p = Process(target=init_child, args=(foo_obj, bar_obj))
    child_p.start()
    # Wait for process to join
    child_p.join()

Output

    Foo Loaded by PID: 41320
Bar Loaded by PID: 41320

Foo Loaded by PID: 23864
Worker Process loaded in PID: 23864

Foo Loaded by PID: 90256
Worker Process loaded in PID: 90256

Foo Loaded by PID: 123352
Worker Process loaded in PID: 123352

As you can see, the foo.py is getting imported in the same process as the worker to recreate the instance.

So, what's the solution?

You can use managers. These spawn a server process, where the managed object is created and stored, allowing other processes to access it's shared state without creating a duplicate. In your case, foo.py will still be imported one extra time to create the instance in the server process. However, after that is done, you can freely pass the instance to other processes without worrying about the module foo.py being imported in that process.

main.py

from multiprocessing import Process
from multiprocessing.managers import BaseManager
import os


# target func for new process
def init_child(foo_obj, bar_obj):
    print(f"Worker Process loaded in PID: {os.getpid()}")
    pass


if __name__ == "__main__":
    # protect imports from child process
    import ultra_random_test
    from foo import get_foo_with_manager, Foo
    from bar import getBar

    BaseManager.register('Foo', Foo)
    manager = BaseManager()
    manager.start()

    # get new objects
    foo_obj = get_foo_with_manager(manager)
    bar_obj = getBar()

    # start new process
    for _ in range(3):
        print()
        child_p = Process(target=init_child, args=(foo_obj, bar_obj))
        child_p.start()
        # Wait for process to join
        child_p.join()

foo.py

import os

print('Foo Loaded by PID: ' + str(os.getpid()))


class Foo:
    def __init__(self):
        pass


def get_foo_with_manager(manager):
    return manager.Foo()

Output

Foo Loaded by PID: 38332
Bar Loaded by PID: 38332
Foo Loaded by PID: 56632

Worker Process loaded in PID: 63404

Worker Process loaded in PID: 66400

Worker Process loaded in PID: 56044

As you can see, only one extra import to instantiate foo_obj in the server process created by manager. After that, foo.py was never imported again.

Benchmark for managers

I test how fast proxies are compared to unmanaged objects, by comparing the speed of getting attribute values using NamespaceProxy:

import time
from multiprocessing.managers import BaseManager, NamespaceProxy



class TimeNamespaceProxy(NamespaceProxy):

    def __getattr__(self, key):
        t = time.time()
        super().__getattr__(key)
        return time.time() - t


class A:
    def __init__(self):
        self.num = 3


def worker(proxy_a):
    times = []
    for _ in range(10000):
        times.append(proxy_a.num)

    print(
        f'Time taken for 10000 calls to get attribute with proxy is : {sum(times)}, with the avg being: {sum(times) / len(times)}')


if __name__ == "__main__":
    BaseManager.register('A', A, TimeNamespaceProxy, exposed=tuple(dir(TimeNamespaceProxy)))
    manager = BaseManager()
    manager.start()
    num = 10000

    a = A()
    proxy_a = manager.A()

    times = []
    t = time.perf_counter()
    for _ in range(num):
        a.num
    times.append(time.perf_counter() - t)

    print(f'Time taken for {num} calls to get attribute without proxy is : {sum(times)}, with the avg being: {sum(times) / num}')

    times = []
    for _ in range(10000):
        times.append(proxy_a.num)

    print(
        f'Time taken for {num} calls to get attribute with proxy is : {sum(times)}, with the avg being: {sum(times) / num}')

Output

Time taken for 10000 calls to get attribute without proxy is : 0.0011279000000000705, with the avg being: 1.1279000000000705e-07
Time taken for 10000 calls to get attribute with proxy is : 1.751499891281128, with the avg being: 0.0001751499891281128

As you can see, without using proxies is more than a 1000 times faster, but, in absolute terms, using proxies only takes about 0.2 ms to pickle/unpickle the function name __getattribute__ and the argument, and then pickle/unpickle the result. For most applications, simply sending over the function name over to the manager process will not be a bottleneck. Things will only get tricky if the return value/arguments to be sent over are complex and take longer to pickle, but these are common shenanigans when working with multiprocessing in general (I would recommend storing the recurring arguments in the managed object itself if possible to reduce overhead).

In short, implementing process safety will always have a tradeoff in performance. If your use-case requires particularly performant code, then you might want to re-evaluate your choice of language because the tradeoff is especially large with python

Onomastic answered 29/6, 2022 at 13:20 Comment(2)
I wonder if the cure you offer might be worse than the illness. When you have a managed object, what is now being passed to init_obj would be a proxy object and when a method call is made on this proxy, it is implemented by sending a message via either a socket or named pipe to a thread started by the manger process to field such a method request and to perform the call on the actual object living in the manager's process. In short, there is execution overhead you would not have with unmanaged objects resulting in much slower performance for every method call.Plantar
@Plantar The question was about not importing sensitive libraries in every child process, and using managers are the simplest way I know that can achieve that. As for the actual overhead, I included a benchmark in the code above. While using proxies will definitely introduce overhead and make it much slower comparatively, whether that makes any noticeable difference depends largely on what you expect your code to do. I have included a disclaimer nonetheless ^^.Onomastic

© 2022 - 2024 — McMap. All rights reserved.