Solving SLURM "sbatch: error: Batch job submission failed: Requested node configuration is not available" error
Asked Answered
K

2

14

We have a 4 GPU nodes with 2 36-core CPUs and 200 GB of RAM available at our local cluster. When I'm trying to submit a job with the follwoing configuration:

#SBATCH --nodes=1
#SBATCH --ntasks=40
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1500MB
#SBATCH --gres=gpu:4
#SBATCH --time=0-10:00:00

I'm getting the following error:

sbatch: error: Batch job submission failed: Requested node configuration is not available

What might be the reason for this error? The nodes have exactly the kind of hardware that I need...

Katzen answered 21/3, 2019 at 23:13 Comment(0)
G
12

The CPUs are most likely 36-threads not 36-cores and Slurm is probably configured to allocate cores and not threads.

Check the output of scontrol show nodes to see what the nodes really offer.

Galinagalindo answered 29/3, 2019 at 13:22 Comment(1)
Thanks! I found the problem. My output was: NodeName=node-16 Arch=x86_64 CoresPerSocket=18, ..., CPUTot=72Katzen
A
1

You're requesting 40 tasks on nodes with 36 CPUs. The default SLURM configuration binds tasks to cores, so reducing the tasks to 36 or fewer may work. (Or increases nodes to 2, if your application can handle that)

Armagnac answered 22/3, 2019 at 6:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.