Submit job with python code (mpi4py) on HPC cluster

I am working a python code with MPI (mpi4py) and I want to implement my code across many nodes (each node has 16 processors) in a queue in a HPC cluster.

My code is structured as below:

from mpi4py import MPI

comm = MPI.COMM_WORLD 
size = comm.Get_size()
rank = comm.Get_rank()

count = 0
for i in range(1, size):
    if rank == i:
        for j in range(5):
            res = some_function(some_argument)
            comm.send(res, dest=0, tag=count)

I am able to run this code perfectly fine on the head node of the cluster using the command

$mpirun -np 48 python codename.py

Here "code" is the name of the python script and in the given example, I am choosing 48 processors. On the head node, for my specific task, the job takes about 1 second to finish (and it successfully gives the desired output).

However, when I run try to submit this same exact code as a job on one of the queues of the HPC cluster, it keeps running for a very long time (many hours) (doesn't finish) and I have to manually kill the job after a day or so. Also, it doesn't give the expected output.

Here is the pbs file that I am using,

#!/bin/sh

#PBS -l nodes=3:ppn=16 
#PBS -N phy
#PBS -m abe
#PBS -l walltime=23:00:00
#PBS -j eo
#PBS -q queue_name

cd $PBS_O_WORKDIR
echo 'This job started on: ' `date`

module load python27-extras
mpirun -np 48 python codename.py

I use the command qsub jobname.pbs to submit the job.

I am confused as to why the code should run perfectly fine on the head node, but run into this problem when I submit this job to run the code across many processors in a queue. I am presuming that I may need to change the pbs script. I will be really thankful if someone can suggest what I should do to run such a MPI script as a job on a queue in a HPC cluster.

Recommended topics

Hot tags