I am using Intel MPI and have encountered some confusing behavior when using mpirun
in conjunction with slurm.
If I run (in a login node)
mpirun -n 2 python -c "from mpi4py import MPI; print(MPI.COMM_WORLD.Get_rank())"
then I get as output the expected 0 and 1 printed out.
If however I salloc --time=30 --nodes=1
and run the same mpirun
from the interactive compute node, I get two 0s printed out instead of the expected 0 and 1.
Then, if I change -n 2
to -n 3
(still in compute node), I get a large error from slurm saying srun: error: PMK_KVS_Barrier task count inconsistent (2 != 1)
(plus a load of other stuff), but I am not sure how to explain this either...
Now, based on this OpenMPI page, it seems these kind of operations should be supported at least for OpenMPI:
Specifically, you can launch Open MPI's mpirun in an interactive SLURM allocation (via the salloc command) or you can submit a script to SLURM (via the sbatch command), or you can "directly" launch MPI executables via srun.
Maybe the Intel MPI implementation I was using just doesn't have the same support and is not designed to be used directly in a slurm environment (?), but I am still wondering: what is the underlying nature of mpirun
and slurm (salloc
) that this is the behavior produced? Why would it print two 0s in the first "case," and what are the inconsistent task counts it talks about in the second "case"?
--nodes=2
in thesalloc
and running thempirun
produces aBAD TERMINATION
error from Intel MPI, usingmpiexec
instead ofmpirun
producessrun: error: PMK_KVS_Barrier duplicate request from task 0
, and the list probably goes on. Am I just not understanding howmpirun
/ slurm should be used? – Masculinempirun
from one implementation with the library of an other one. – Bibliology