Slurm: How to restart failed worker job
Asked Answered
N

1

5

If one is running an array job on a slurm cluster, how can one restart a failed worker job?

In a Sun Grid Engine queue, one can add #$ -r y to the job file to indicate the job should be restarted if it fails--what is the Slurm equivalent of this flag?

Neuropathy answered 2/6, 2018 at 22:34 Comment(0)
C
6

You can use --requeue

#SBATCH --requeue                   ### On failure, requeue for another try

--requeue

Specifies that the batch job should eligible to being requeue. The job may be requeued explicitly by a system administrator, after node failure, or upon preemption by a higher priority job. When a job is requeued, the batch script is initiated from its beginning. Also see the --no-requeue option. The JobRequeue configuration parameter controls the default behavior on the cluster.

See more here: https://slurm.schedmd.com/sbatch.html#lbAE

Chrysa answered 3/6, 2018 at 22:32 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.