Calling mpi binary in serial as subprocess of mpi application
Asked Answered
O

1

7

I have a large parallel (using MPI) simulation application which produces large amounts of data. In order to evaluate this data I use a python script.

What I now need to do is to run this application a large number of times (>1000) and calculate statistical properties from the resulting data.

My approach up until now is, to have a python script running in parallel (using mpi4py, using i.e. 48 nodes) calling the simulation code using subprocess.check_call. I need this call to run my mpi simulation application in serial. I do not need the simulation to also run in parallel in this case. The python script can then analyze the data in parallel and after finishing it will startup a new simulation run till a large number of runs is accumulated.

Goals are

  • not saving the whole data set from 2000 runs
  • keeping intermediate data in memory

Stub MWE:

file multi_call_master.py:

from mpi4py import MPI
import subprocess

print "Master hello"

call_string = 'python multi_call_slave.py'

comm = MPI.COMM_WORLD

rank = comm.Get_rank()
size = comm.Get_size()

print "rank %d of size %d in master calling: %s" % (rank, size, call_string)

std_outfile = "./sm_test.out"
nr_samples = 1
for samples in range(0, nr_samples):
    with open(std_outfile, 'w') as out:
        subprocess.check_call(call_string, shell=True, stdout=out)
#       analyze_data()
#       communicate_results()

file multi_call_slave.py (this would be the C simulation code):

from mpi4py import MPI

print "Slave hello"

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
print "rank %d of size %d in slave" % (rank, size)

This will not work. Resulting output in stdout:

Master hello
rank 1 of size 2 in master calling: python multi_call_slave_so.py
Master hello
rank 0 of size 2 in master calling: python multi_call_slave_so.py
[cli_0]: write_line error; fd=7 buf=:cmd=finalize
:
system msg for write_line failure : Broken pipe
Fatal error in MPI_Finalize: Other MPI error, error stack:
MPI_Finalize(311).....: MPI_Finalize failed
MPI_Finalize(229).....: 
MPID_Finalize(150)....: 
MPIDI_PG_Finalize(126): PMI_Finalize failed, error -1
[cli_1]: write_line error; fd=8 buf=:cmd=finalize
:
system msg for write_line failure : Broken pipe
Fatal error in MPI_Finalize: Other MPI error, error stack:
MPI_Finalize(311).....: MPI_Finalize failed
MPI_Finalize(229).....: 
MPID_Finalize(150)....: 
MPIDI_PG_Finalize(126): PMI_Finalize failed, error -1

Resulting output in sm_test.out:

Slave hello
rank 0 of size 2 in slave

The reason is, that the subprocess assumes to be run as a parallel application, whereas I intend to run it as a serial application. As a very "hacky" workaround I did the following:

  • Compile all needed mpi aware libraries with a specific mpi distribution, i.e. intel mpi
  • Compile simulation code with a different mpi library, i.e. openmpi

If I would now start my parallel python script using intel mpi, the underlying simulation would not be aware of the surrounding parallel environment as it was using a different library.

This worked fine for a while, but unfortunately is not very portable and difficult to maintain on different clusters for various reasons.

I could

  • put the subprocess calling loop into a shell script using srun
    • Would mandate buffering results on HD
  • use some kind of MPI_Comm_spawn technique in python
    • not meant to be used like that
    • difficult to find out if subprocess finished
    • propably changes to C code necessary
  • Somehow trick the subprocess into not forwarding MPI information
    • tried manipulating the environment variables to no avail
    • also not meant to be used like that
    • using mpirun -n 1 or srun for the subprocess call does not help

Is there any elegant, official way of doing this? I am really out of ideas and appreciate any input!

Ollie answered 13/1, 2014 at 11:24 Comment(0)
R
7

No, there is neither an elegant nor an official way to do this. The only officially supported way to execute other programs from within an MPI application is the use of MPI_Comm_spawn. Spawning child MPI processes via simple OS mechanisms like the one provided by subprocess is dangerous and could even have catastrophic consequences in certain cases.

While MPI_Comm_spawn does not provide a mechanism to find out when the child process has exited, you could kind of simulate it with an intercomm barrier. You will still face problems since the MPI_Comm_spawn call does not allow for the standard I/O to be redirected arbitrarily and instead it gets redirected to mpiexec/mpirun.

What you could do is to write a wrapper script that deletes all possible pathways that the MPI library might use in order to pass session information around. For Open MPI that would be any environment variable that starts with OMPI_. For Intel MPI that would be variables that start with I_. And so on. Some libraries might use files or shared memory blocks or some other OS mechanisms and you'll have to take care of those too. Once any possible mechanism to communicate MPI session information has been eradicated, you could simply start the executable and it should form a singleton MPI job (that is, behave as if run with mpiexec -n 1).

Reagan answered 13/1, 2014 at 22:13 Comment(4)
Thank you very much for your insights! I went the MPI_Comm_spawn route yesterday and had the I/O problem, which I will solve using an intermediate file piping bash script. For finish detection I will be forced to modify the C code as an intermediate bash script does not work, because it will involve multiple MPI_inits or multiple MPI_Comm_spawns... This is all really messy and I do not understand why this is not properly defined in the MPI Interface standard.Ollie
The main goal of MPI is to be highly portable, not versatile. That's why the standard purposely refrains from incorporating OS-specific things like extended process control. Simple example: fork() is not available on Blue Gene. Consequently system() is not supported too.Reagan
That makes sense, thank you. Shouldn't keep them from offering a blocking MPI_Comm_spawn version with some kind of piping support, though.Ollie
Such support was initially proposed for MPI-2 and consequently voted against in the final stage of the process back in 1996: "ammendment to remove all independent functions, signal, and monitor from dynamic chapter in MPI-2. This removes 4.3.4 (Starting Independent Processes), 4.3.5 (Starting multiple independent processes), 4.3.6 (Nonblocking requests) part 2, 4.5.2 (Signaling Processes), and 4.5.3 (Notification of change in state of a process). - 14 yes / 7 no / 4 abstain"Reagan

© 2022 - 2024 — McMap. All rights reserved.