End batch job before kill via walltime
Asked Answered
T

2

8

I am running a batch job with SLURM. The process I start in the jobfile is iterative. After each iteration, the program can be killed softly by creating a file called stop. I would like such a stop command to be issued authomatically one hour before the job is killed via the walltime limit.

Ticklish answered 7/11, 2014 at 13:18 Comment(1)
Actually you want to have a custom termination script. In PBS Pro this is achieved via the $action terminate configuration parameter that takes a timeout parameter that can be set to any value, e.g. 1 hour. That is if the walltime is exceeded the $action terminate script is invoked and the remaining processes if any are killed and cleanded up in a normal way when the timeout is exceeded.Mcginnis
G
14

You can have Slurm signal your job a configurable amount of time before the time limit happens with the --signal option

from the sbatch man page:

--signal=[B:][@] When a job is within sig_time seconds of its end time, send it the signal sig_num. Due to the resolution of event handling by SLURM, the signal may be sent up to 60 seconds earlier than specified. sig_num may either be a signal number or name (e.g. "10" or "USR1"). sig_time must have integer value between zero and 65535. By default, no signal is sent before the job’s end time. If a sig_num is specified without any sig_time, the default time will be 60 seconds. Use the "B:" option to signal only the batch shell, none of the other processes will be signaled. By default all job steps will be signalled, but not the batch shell itself.

If you can modify your program to catch that signal to stop rather than looking for a file, then this is the best option.

If you can't, add something like

trap  "touch ./stop"  SIGUSR1

in your submission script. With --signal=B:SIGUSR1@3600 this will make the script catch the SIGUSR1 signal and create the stop file one hour before the end of the allocation.

Note that only the recent versions of Slurm have the B: option in --signal. If your version does not have it, you'll need to setup a watch dog. See examples here.

Gaultheria answered 8/11, 2014 at 21:13 Comment(0)
S
0

To add on damienfrancois' answer, it should be noted that if the batch script starts another blocking process, the signal will not be propagated to it. The process should be launched in the background and then waited on, i.e.:

SBATCH --signal=B:USR1@600

trap "echo Signal USR1 received!; kill -s SIGUSR1 ${PID}; wait ${PID}" USR1
my_script &    # launch my_script as a background job
PID=$!         # get the PID of the background job
wait ${PID}    # wait for the background job to finish

This will launch my_script in the background and propagate the SIGUSR1 signal to it when slurm sends it 10 minutes before the job ends, so that the script can catch it, save a checkpoint and exit gracefully.

Stichometry answered 21/6, 2021 at 7:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.