I am trying to understand what the difference is between SLURM's srun
and sbatch
commands. I will be happy with a general explanation, rather than specific answers to the following questions, but here are some specific points of confusion that can be a starting point and give an idea of what I'm looking for.
According to the documentation, srun
is for submitting jobs, and sbatch
is for submitting jobs for later execution, but the practical difference is unclear to me, and their behavior seems to be the same. For example, I have a cluster with 2 nodes, each with 2 CPUs. If I execute srun testjob.sh &
5x in a row, it will nicely queue up the fifth job until a CPU becomes available, as will executing sbatch testjob.sh
.
To make the question more concrete, I think a good place to start might be: What are some things that I can do with one that I cannot do with the other, and why?
Many of the arguments to both commands are the same. The ones that seem the most relevant are --ntasks
, --nodes
, --cpus-per-task
, --ntasks-per-node
. How are these related to each other, and how do they differ for srun
vs sbatch
?
One particular difference is that srun
will cause an error if testjob.sh
does not have executable permission i.e. chmod +x testjob.sh
whereas sbatch
will happily run it. What is happening "under the hood" that causes this to be the case?
The documentation also mentions that srun
is commonly used inside of sbatch
scripts. This leads to the question: How do they interact with each other, and what is the "canonical" usecase for each them? Specifically, would I ever use srun
by itself?
srun
inside the submission script? Perhaps I'm confused about the meaning of a "job step." For example, if I have a script calledrunjob.sh
that contains#!/bin/bash srun myjob.sh
, is there a practical difference between calling (a)sbatch runjob.sh
vs (b)sbatch myjob.sh
vs (c)srun myjob.sh
vs (d)srun runjob.sh
? (Clearly the last one is silly, but I'm curious). – Polygamy