Is it possible to run SLURM jobs in the background using SRUN instead of SBATCH?
Asked Answered
L

3

15

I was trying to run slurm jobs with srun on the background. Unfortunately, right now due to the fact I have to run things through docker its a bit annoying to use sbatch so I am trying to find out if I can avoid it all together.

From my observations, whenever I run srun, say:

srun docker image my_job_script.py

and close the window where I was running the command (to avoid receiving all the print statements) and open another terminal window to see if the command is still running, it seems that my running script is for some reason cancelled or something. Since it isn't through sbatch it doesn't send me a file with the error log (as far as I know) so I have no idea why it closed.

I also tried:

srun docker image my_job_script.py &

to give control back to me in the terminal. Unfortunately, if I do that it still keeps printing things to my terminal screen, which I am trying to avoid.

Essentially, I log into a remote computer through ssh and then do a srun command, but it seems that if I terminate the communication of my ssh connection, the srun command is automatically killed. Is there a way to stop this?

Ideally I would like to essentially send the script to run and not have it be cancelled for any reason unless I cancel it through scancel and it should not print to my screen. So my ideal solution is:

  1. keep running srun script even if I log out of the ssh session
  2. keep running my srun script even if close the window from where I sent the command
  3. keep running my srun script and let me leave the srun session and not print to my scree (i.e. essentially run to the background)

this would be my idea solution.


For the curious crowd that want to know the issue with sbatch, I want to be able to do (which is the ideal solution):

sbatch docker image my_job_script.py

however, as people will know it does not work because sbatch receives the command docker which isn't a "batch" script. Essentially a simple solution (that doesn't really work for my case) would be to wrap the docker command in a batch script:

#!/usr/bin/sh
docker image my_job_script.py

unfortunately I am actually using my batch script to encode a lot of information (sort of like a config file) of the task I am running. So doing that might affect jobs I do because their underlying file is changing. That is avoided by sending the job directly to sbatch since it essentially creates a copy of the batch script (as noted in this question: Changing the bash script sent to sbatch in slurm during run a bad idea?). So the real solution to my problem would be to actually have my batch script contain all the information that my script requires and then somehow in python call docker and at the same time pass it all the information. Unfortunately, some of the information are function pointers and objects, so its not even clear to me how I would pass such a thing to a docker command ran in python.


or maybe being able to run docker directly to sbatch instead of using a batch script with also solve the problem.

Litmus answered 10/2, 2017 at 18:39 Comment(2)
And using & and redirect the output with -o? I am not sure but if srun docker image my_job_script.py & works for you, except for the output, how about: srun -o output.txt docker image my_job_script.py &. You could also redirect the stderr with -e.Kelson
@SergioIserte that seems to have worked so far...now the only caveat is that the slurm set up that I have kills my jobs every 6hours. So if after 6hours it tries to run it again I wonder if it will just call my original command or not. Just wondering cuz maybe it might be best that the argument to the -o argument to be an absolute path (or any other unexpected caveat might happen)Litmus
K
9

The outputs can be redirected with the options -o stdout and -e for stderr.

So, the job can be launched in background and with the outputs redirected:

$ srun -o file.out -e file.errr docker image my_job_script.py &
Kelson answered 11/2, 2017 at 21:8 Comment(1)
hilarious, is this how sbatch is implemented or what is the difference. I know sbatch makes a copy of the batch script. Does this one make a copy of my my_job_script.py too?Litmus
K
3

Another approach is to use a terminal multiplexer like tmux or screen.

For example, create a new tmux window type tmux. In that window, use srun with your script. From there, you can then detach the tmux window, which returns you to your main shell so you can go about your other business, or you can logoff entirely. When you want to check in on your script, just reattach to the tmux window. See the documentation tmux -h for how to detach and reattach on your OS.

Any output redirects using the -o or -e will still work with this technique and you can run multiple srun commands concurrently in different tmux windows. I’ve found this approach useful to run concurrent pipelines (genomics in this case).

Karns answered 20/7, 2018 at 18:58 Comment(0)
C
2

I was wondering this too because the differences between sbatch and srun are not very clearly explainer or motivated. I looked at the code and found:

sbatch

sbatch pretty much just sends a shell script to the controller, tells it to run it and then exits. It does not need to keep running while the job is happening. It does have a --wait option to stay running until the job is finished but all it does is poll the controller every 2 seconds to ask it.

sbatch can't run a job across multiple nodes - the code simply isn't in sbatch.c. sbatch is not implemented in terms of srun, it's a totally different thing.

Also its argument must be a shell script. Bit of a weird limitation but it does have a --wrap option so that it can automatically wrap a real program in a shell script for you. Good luck getting all the escaping right with that!

srun

srun is more like an MPI runner. It directly starts tasks on lots of nodes (one task per node by default though you can override that with --ntasks). It's intended for MPI so all of the jobs will run simultaneously. It won't start any until all the nodes have a slot free.

It must keep running while the job is in progress. You can send it to the background with & but this is still different to sbatch. If you need to start a million sruns you're going to have a problem. A million sbatchs should (in theory) work fine.

There is no way to have srun exit and leave the job still running like there is with sbatch. srun itself acts as a coordinator for all of the nodes in the job, and it updates the job status etc. so it needs to be running for the whole thing.

Cerussite answered 23/1, 2023 at 13:46 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.