How to submit/run multiple parallel jobs with Slurm/Sbatch?
Asked Answered
T

1

5

I am trying to submit a large number of jobs (several hundred) to a Slurm server and was hoping to avoid having to submit a new shell script for each job I wanted to run. The code submitted is a Python script that takes two input variables in the shell script and those variables are the only thing that changes between jobs. An example of a short shell script that works for a single job is:

#!/bin/bash

#SBATCH -n 1
#SBATCH -t 01:00:00

srun python retrieve.py --start=0 --end=10

What I want is to submit a large number of jobs with the same python script and only change the 'start' and 'end' variables between jobs. I read something about just increasing the number of cores needed ('-n') and writing an & symbol after each srun-command, but I've been unable to get it to work so far.

If anyone knows a quick way to do this, I would appreciate the help a lot!

Titlark answered 2/4, 2021 at 12:44 Comment(1)
Like this? for ((i=0; i<=100; i+=10)); do srun python retrieve.py --start="$i" --end="$((i+10))" & donePolytypic
C
11

To build up from your current solution, you can move to using two CPUs rather than one with:

#!/bin/bash

#SBATCH -n 2
#SBATCH -t 01:00:00

srun -n1 --exclusive python retrieve.py --start=0 --end=10 &
srun -n1 --exclusive python retrieve.py --start=10 --end=20 &
wait

(you might need to adapt the --end based on whether the bounds are inclusive or exclusive)

The above script requests 2 CPUs and creates two tasks running the Python script with the different arguments. The --exclusive part is necessary for Slurm versions prior to 20.11 (from memory). It has nothing to do with the eponym option of sbatch that requests wholes nodes.

The ampersand (&) allows both tasks to run in parallel and the wait command is there to make sure the script does not terminate before the tasks, otherwise Slurm will just kill them.

You can generalise with a Bash for-loop or using the GNU Parallel command.

This however will not submit multiple jobs, it will submit one job with multiple tasks.

If you want to submit multiple jobs, you will need a job array.

#!/bin/bash

#SBATCH -n 1
#SBATCH -t 01:00:00
#SBATCH --array=0-10:10

srun python retrieve.py --start=${SLURM_ARRAY_TASK_ID} --end=$((SLURM_ARRAY_TASK_ID+10))

This will submit two independent jobs that will perform the same work as the job described before.

Chesney answered 7/4, 2021 at 12:20 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.