Using Python multiprocessing on an HPC cluster

I am running a Python script on a Windows HPC cluster. A function in the script uses starmap from the multiprocessing package to parallelize a certain computationally intensive process.

When I run the script on a single non-cluster machine, I obtain the expected speed boost. When I log into a node and run the script locally, I obtain the expected speed boost. However, when the job manager runs the script, the speed boost from multiprocessing is either completely mitigated or, sometimes, even 2x slower. We have noticed that memory paging occurs when the starmap function is called. We believe that this has something to do with the nature of Python's multiprocessing, i.e. the fact that a separate Python interpreter is kicked off for each core.

Since we had success running from the console from a single node, we tried to run the script with HPC_CREATECONSOLE=True, to no avail.

Is there some kind of setting within the job manager that we should use when running Python scripts that use multiprocessing? Is multiprocessing just not appropriate for an HPC cluster?

Unfortunately I wasn't able to find an answer in the community. However, through experimentation, I was able to better isolate the problem and find a workable solution.

The problem arises from the nature of Python's multiprocessing implementation. When a Pool object is created (i.e. the manager class that controls the processing cores for the parallel work), a new Python run-time is started for each core. There are multiple places in my code where the multiprocessing package is used and a Pool object instantiated... every function that requires it creates a Pool object as needed and then joins and terminates before exiting. Therefore, if I call the function 3 times in the code, 8 instances of Python are spun up and then closed 3 times. On a single machine, the overhead of this was not significant at all compared to the computational load of the functions... however on the HPC it was absurdly high.

I re-architected the code so that a Pool object is created at the very beginning of the calling of process and then passed to each function as needed. It is closed, joined, and terminated at the end of the overall process.

We found that the bulk of the time was spent in the creation of the Pool object on each node. This was an improvement though because it was only being created once! We then realized that the underlying problem was that multiple nodes were trying to access Python at the same time in the same place from over the network (it was only installed on the head node). We installed Python and the application on all nodes, and the problem was completely fixed.

This solution was the result of trial and error... unfortunately our knowledge of cluster computing is pretty low at this point. I share this answer in the hopes that it will be critiqued so that we can obtain even more insight. Thank you for your time.

Recommended topics

Hot tags