Wait for set of qsub jobs to complete
Asked Answered
H

9

42

I have a batch script which starts off a couple of qsub jobs, and I want to trap when they are all completed.

I don't want to use the -sync option, because I want them to be running simultaneously. Each job has a different set of command line parameters.

I want my script to wait till when all the jobs have been completed, and do something after that. I don't want to use the sleep function e.g. to check if certain files have been generated after each 30 s, because this is a drain on resources.

I believe Torque may have some options, but I am running SGE.

Any ideas on how I could implement this please?

Thanks P.s. I did find another thread Link

which had a reponse

You can use wait to stop execution until all your jobs are done. You can even collect all the exit statuses and other running statistics (time it took, count of jobs done at the time, whatever) if you cycle around waiting for specific ids.

but I am not sure how to use it without polling on some value. Can bash trap be used, but how would I with qsub?

Heavyhanded answered 17/7, 2012 at 14:52 Comment(1)
You are correct that there is a way to do this in TORQUE. I don't know if SGE has an option to do this.Hereditary
L
41

Launch your qsub jobs, using the -N option to give them arbitrary names (job1, job2, etc):

qsub -N job1 -cwd ./job1_script
qsub -N job2 -cwd ./job2_script
qsub -N job3 -cwd ./job3_script

Launch your script and tell it to wait until the jobs named job1, job2 and job3 are finished before it starts:

qsub -hold_jid job1,job2,job3 -cwd ./results_script
Lanugo answered 31/5, 2013 at 15:47 Comment(3)
This seems to not work if the list of jobs is too long (I have 40 jobs, the command ends up being 940 chars...)Algid
Hrm.. no, that's not the problem. It's that PBS Pro uses a different format. You need to use -W depend=afterok:<job_id>[:<job_id>:...]Algid
is there a way to pass arguments to results_script?Dol
D
7

If all the jobs have a common pattern in the name, you can provide that pattern when you submit the jobs. https://linux.die.net/man/1/sge_types shows you what patterns you can use. example:

-hold_jid "job_name_pattern*"
Dorcia answered 13/12, 2017 at 19:40 Comment(0)
E
4

Another alternative (from here) is as follows:

FIRST=$(qsub job1.pbs)
echo $FIRST
SECOND=$(qsub -W depend=afterany:$FIRST job2.pbs)
echo $SECOND
THIRD=$(qsub -W depend=afterany:$SECOND job3.pbs)
echo $THIRD

The insight is that qsub returns the jobid and this is typically dumped to standard output. Instead, capture it in a variable ($FIRST, $SECOND, $THIRD) and use the -W depend=afterany:[JOBIDs] flag when you enqueue your jobs to control the dependency structure of when they are dequeued.

Evolutionary answered 10/12, 2015 at 18:22 Comment(0)
D
3
qsub -hold_jid job1,job2,job3 -cwd ./myscript
Dropsy answered 3/9, 2012 at 20:27 Comment(1)
To improve the quality of your post please include why/how your post solves the problem.Rhombus
L
3

This works in bash, but the ideas should be portable. Use -terse to facilitate building up a string with job ids to wait on; then submit a dummy job that uses -hold_jid to wait on the previous jobs and -sync y so that qsub doesn't return until it (and thus all prereqs) has finished:

# example where each of three jobs just sleeps for some time:
job_ids=$(qsub -terse -b y sleep 10)
job_ids=job_ids,$(qsub -terse -b y sleep 20)
job_ids=job_ids,$(qsub -terse -b y sleep 30)
qsub -hold_jid ${job_ids} -sync y -b y echo "DONE"  
  • -terse option makes the output of qsub just be the job id
  • -hold_jid option (as mentioned in other answers) makes a job wait on specified job ids
  • -sync y option (referenced by the OP) asks qsub not to return until the submitted job is finished
  • -b y specifies that the command is not a path to a script file (for instance, I'm using sleep 30 as the command)

See the man page for more details.

Landrum answered 26/9, 2017 at 21:7 Comment(0)
H
2
#!/depot/Python-2.4.2/bin/python

import os
import subprocess
import shlex

def trackJobs(jobs, waittime=4):
    while len(jobs) != 0:
        for jobid in jobs:
            x = subprocess.Popen(['qstat', '-j', jobid], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
            std_out, std_err = x.communicate()
            if std_err :
                jobs.remove(jobid)
                break
        os.system("sleep " + str(waittime))
    return

This is the simple code where you can track the stauts of completion of qsub jobs . Here function accepts list of jobIds (for example ['84210770', '84210774', '84210776', '84210777', '84210778'] )

Hermilahermina answered 3/4, 2020 at 15:30 Comment(1)
The shlex import does not look necessary. Also your shebang looks environment specific; #!/usr/bin/env python is more portable.Rhynchocephalian
W
1

In case you have 150 files that you want process and be able to run only 15 each time, while the other are in holding in the queue you can set something like this.

# split my list files in a junk of small list having 10 file each
awk 'NR%10==1 {x="F"++i;}{ print >  "list_part"x".txt" }'  list.txt

qsub all the jobs in such a way that the first of each list_part*.txt hold the second one ....the second one hold the third one .....and so on.

for list in $( ls list_part*.txt ) ; do
    PREV_JOB=$(qsub start.sh) # create a dummy script start.sh just for starting
 for file in  $(cat $list )  ; do
   NEXT_JOB=$(qsub -v file=$file  -W depend=afterany:$PREV_JOB  myscript.sh )
   PREV_JOB=$NEXT_JOB
 done
done

This is useful if you have in myscript.sh a procedure that require move or download many files or create intense traffic in the cluster-lan

Wardell answered 17/3, 2016 at 13:46 Comment(0)
H
1

You can start a job array qsub -N jobname -t 1-"$numofjobs" -tc 20, then it has only one job id and runs 20 at a time. You give it a name, and just hold until that array is done using qsub -hold_jid jid or qsub -hold_jid jobname.

Halfslip answered 5/6, 2020 at 9:2 Comment(0)
G
0

I needed more flexibility, so I built a Python module for this and other purposes here. You can run the module directly as a script (python qsub.py) for a demo.

Usage:

$ git clone https://github.com/stevekm/util.git
$ cd util
$ python
Python 2.7.3 (default, Mar 29 2013, 16:50:34)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import qsub
>>> job = qsub.submit(command = 'echo foo; sleep 60', print_verbose = True)
qsub command is:

qsub -j y -N "python" -o :"/home/util/" -e :"/home/util/" <<E0F
set -x
echo foo; sleep 60
set +x
E0F

>>> qsub.monitor_jobs(jobs = [job], print_verbose = True)
Monitoring jobs for completion. Number of jobs in queue: 1
Number of jobs in queue: 0
No jobs remaining in the job queue
([Job(id = 4112505, name = python, log_dir = None)], [])

Designed with Python 2.7 and SGE since thats what our system runs. The only non-standard Python libraries required are the included tools.py and log.py modules, and sh.py (also included)

Obviously not as helpful if you wish to stay purely in bash, but if you need to wait on qsub jobs then I would imagine your workflow is edging towards a complexity that would benefit from using Python instead.

Grazing answered 16/11, 2017 at 20:54 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.