Python multiprocessing within mpi

I have a python script that I've written using the multiprocessing module, for faster execution. The calculation is embarrassingly parallel, so the efficiency scales with the number of processors. Now, I'd like to use this within an MPI program, which manages an MCMC calculation across multiple computers. This code has a call to system() which invokes the python script. However, I'm finding that when it is called this way, the efficiency gain from using python multiprocessing vanishes.

How can I get my python script to retain the speed gains from multiprocessing when called from MPI?

Here is a simple example, which is analogous to the much more complicated codes I want to use but displays the same general behavior. I write an executable python script called junk.py.

#!/usr/bin/python
import multiprocessing
import numpy as np

nproc = 3
nlen = 100000


def f(x):
    print x
    v = np.arange(nlen)
    result = 0.
    for i, y in enumerate(v):
        result += (x+v[i:]).sum()
    return result


def foo():
    pool = multiprocessing.Pool(processes=nproc)
    xlist = range(2,2+nproc)
    print xlist
    result = pool.map(f, xlist)
    print result

if __name__ == '__main__':
    foo()

When I run this from the shell by itself, using "top" I can see three python processes each taking 100% of cpu on my 16-core machine.

node094:mpi[ 206 ] /usr/bin/time junk.py
[2, 3, 4]
2
3
4
[333343333400000.0, 333348333450000.0, 333353333500000.0]
62.68user 0.04system 0:21.11elapsed 297%CPU (0avgtext+0avgdata 16516maxresident)k
0inputs+0outputs (0major+11092minor)pagefaults 0swaps

However, if I invoke this with mpirun, each python process takes 33% of cpu, and overall it takes about three times as long to run. Calling with -np 2 or more results in more processes, but doesn't speed up the computation any.

node094:mpi[ 208 ] /usr/bin/time mpirun -np 1 junk.py
[2, 3, 4]
2
3
4
[333343333400000.0, 333348333450000.0, 333353333500000.0]
61.63user 0.07system 1:01.91elapsed 99%CPU (0avgtext+0avgdata 16520maxresident)k
0inputs+8outputs (0major+13715minor)pagefaults 0swaps

(Additional notes: This is mpirun 1.8.1, python 2.7.3 on Linux Debian version wheezy. I have heard system() is not always allowed within MPI programs, but it's been working for me for the last five years on this computer. For example I have called a pthread-based parallel code from system() within an MPI program, and it's used 100% of cpu for each thread, as desired. Also, in case you were going to suggest running the python script in serial and just calling it on more nodes...the MCMC calculation involves a fixed number of chains which need to move in a synchronized way, so the computation unfortunately can't be reorganized that way.)

OpenMPI's mpirun, v1.7 and later, defaults to binding processes to cores - that is, when it launches the python junk.py process, it binds it to the core that it will run on. That's fine, and the right default behaviour for most MPI use cases. But here each MPI task is then forking more processes (through the multiprocessing package), and those forked processes inherit the binding state of their parent - so they're all bound to the same core, fighting amongst themselves. (The "P" column in top will show you they're all on the same processor)

If you mpirun -np 2, you'll find two sets of three processes, each on a different core, each contending amongst themselves.

With OpenMPI, you can avoid this by turning off binding,

mpirun -np 1 --bind-to none junk.py

or choosing some other binding which makes sense given the final geometry of your run. MPICH has similar options with hydra.

Note that the fork()ing of subprocesses with mpi isn't always safe or supported, particularly with clusters running with infiniband interconnects, but OpenMPI's mpirun/mpiexec will warn you if it isn't safe.

Recommended topics

Hot tags