Why does importing module in '__main__' not allow multiprocessig to use module?
Asked Answered
S

1

8

I've already solved my problem by moving the import to the top declarations, but it left me wondering: Why cant I use a module that was imported in '__main__' in functions that are the targets of multiprocessing?

For example:

import os
import multiprocessing as mp

def run(in_file, out_dir, out_q):
    arcpy.RaterToPolygon_conversion(in_file, out_dir, "NO_SIMPIFY", "Value")
    status = str("Done with "+os.path.basename(in_file))
    out_q.put(status, block=False)

if __name__ == '__main__':
    raw_input("Program may hang, press Enter to import ArcPy...")
    import arcpy

    q = mp.Queue()
    _file = path/to/file
    _dir = path/to/dir
    # There are actually lots of files in a loop to build
    # processes but I just do one for context here
    p = mp.Process(target=run, args=(_file, _dir, q))
    p.start()

# I do stuff with Queue below to status user

When you run this in IDLE it doesn't error at all...just keeps doing a Queue check (which is good so not the problem). The problem is that when you run this in the CMD terminal (either OS or Python) it produces the error that arcpy is not defined!

Just a curious topic.

Serve answered 21/4, 2017 at 14:24 Comment(3)
Are you running on linux or windows?Lymanlymann
@Lymanlymann Windows, that's why I am using the if __name__ statement.Serve
On WIndows, multiprocessing effectively imports the main script into each Python subprocess it spawns, so the if __name__ == '__main__' will be False in those cases. In your script, that means that the module arcpy won't have been imported when run() is executed because the process it's in executes in completely separate memory-space.Ephraim
L
7

The situation is different in unix-like systems and Windows. On the unixy systems, multiprocessing uses fork to create child processes that share a copy-on-write view of the parent memory space. The child sees the imports from the parent, including anything the parent imported under if __name__ == "__main__":.

On windows, there is no fork, a new process has to be executed. But simply rerunning the parent process doesn't work - it would run the whole program again. Instead, multiprocessing runs its own python program that imports the parent main script and then pickles/unpickles a view of the parent object space that is, hopefully, sufficient for the child process.

That program is the __main__ for the child process and the __main__ of the parent script doesn't run. The main script was just imported like any other module. The reason is simple: running the parent __main__ would just run the full parent program again, which mp must avoid.

Here is a test to show what is going on. A main module called testmp.py and a second module test2.py that is imported by the first.

testmp.py

import os
import multiprocessing as mp

print("importing test2")
import test2

def worker():
    print('worker pid: {}, module name: {}, file name: {}'.format(os.getpid(), 
        __name__, __file__))

if __name__ == "__main__":
    print('main pid: {}, module name: {}, file name: {}'.format(os.getpid(), 
        __name__, __file__))
    print("running process")
    proc = mp.Process(target=worker)
    proc.start()
    proc.join()

test2.py

import os

print('test2 pid: {}, module name: {}, file name: {}'.format(os.getpid(),
        __name__, __file__))

When run on Linux, test2 is imported once and the worker runs in the main module.

importing test2
test2 pid: 17840, module name: test2, file name: /media/td/USB20FD/tmp/test2.py
main pid: 17840, module name: __main__, file name: testmp.py
running process
worker pid: 17841, module name: __main__, file name: testmp.py

Under windows, notice that "importing test2" is printed twice - testmp.py was run two times. But "main pid" was only printed once - its __main__ wasn't run. That's because multiprocessing changed the module name to __mp_main__ during import.

E:\tmp>py testmp.py
importing test2
test2 pid: 7536, module name: test2, file name: E:\tmp\test2.py
main pid: 7536, module name: __main__, file name: testmp.py
running process
importing test2
test2 pid: 7544, module name: test2, file name: E:\tmp\test2.py
worker pid: 7544, module name: __mp_main__, file name: E:\tmp\testmp.py
Lymanlymann answered 21/4, 2017 at 14:35 Comment(7)
My mp.Process() is not rerunning __main__ each time is it? I just want each child process to run the code in def run().Serve
No, but it is re-importing your main module and calling it __mp_main__. That's why you hide stuff you don't want rerun under if __name__ == "__main__":. I've updated the answer with a demo.Lymanlymann
Child startup is significantly more expensive in Windows - a new copy of python is executed and modules are imported.Lymanlymann
Excellent explanation! Unfortunately, arcpy cant be multithreaded and the processes can only run in parallel with multiprocessing. It is very expensive, do you have any suggestions to make it less so?Serve
Start long-lived subprocesses early and keep them around a long time. Maybe a Pool and use apply when running a work item. Modules not needed by the parent can be imported in the worker process itself. Passing large datasets from parent to child is expensive also... have child read original files from disk if possible.Lymanlymann
Alternately, write a completely separate child process that communicates with some sort of RPC. maybe python's xml-rpc or (more to my liking) zeromq. Once again, keeping the payload between parent and child lean really helps.Lymanlymann
Thanks for that great explanation of the functions _fixup_main_from_name and _fixup_main_from_path from the module multiprocessing.spawn.Hydrotaxis

© 2022 - 2024 — McMap. All rights reserved.