I am trying to adjust some bash scripts to make them run on a (pbs) cluster.
The individual tasks are performed by several script thats are started by a main script.
So far this main scripts starts multiple scripts in background (by appending &
) making them run in parallel on one multi core machine.
I want to substitute these calls by qsub
s to distribute load accross the cluster nodes.
However, some jobs depend on others to be finished before they can start.
So far, this was achieved by wait
statements in the main script.
But what is the best way to do this using the grid engine?
I already found this question as well as the -W after:jobid[:jobid...]
documentation in the qsub
man page but I hope there is a better way.
We are talking about several thousend jobs to run in parallel first and another set of the same size to run simultatiously after the last one of these finished.
This would mean I had to queue a lot of jobs depending on a lot of jobs.
I could bring this down by using a dummy job in between, doing nothing but depending on the first group of jobs, on which the second group could depend. This would decrease the number of dependencies from millions to thousands but still: It feeles wrong and I am not even sure if such a long command line would be accepted by the shell.
- Isn't there a way to wait for all my jobs to finish (something like
qwait -u <user>
)? - Or all jobs that where submitted from this script (something like
qwait [-p <PID>]
)?
Of course it would be possible to write something like this using qstat
and sleep
in a while
loop, but I guess this use case is important enough to have a built in solution and I was just incapable to figure that one out.
What would you recommend / use in such a situation?
Addendum I:
Since it was requested in a comment:
$ qsub --version
version: 2.4.8
Maybe also helpful to determine the exact pbs system:
$ qsub --help
usage: qsub [-a date_time] [-A account_string] [-b secs]
[-c [ none | { enabled | periodic | shutdown |
depth=<int> | dir=<path> | interval=<minutes>}... ]
[-C directive_prefix] [-d path] [-D path]
[-e path] [-h] [-I] [-j oe] [-k {oe}] [-l resource_list] [-m n|{abe}]
[-M user_list] [-N jobname] [-o path] [-p priority] [-P proxy_user] [-q queue]
[-r y|n] [-S path] [-t number_to_submit] [-T type] [-u user_list] [-w] path
[-W otherattributes=value...] [-v variable_list] [-V] [-x] [-X] [-z] [script]
Since the comments point to job arrays so far I searched the qsub
man page with the following results:
[...]
DESCRIPTION
[...]
In addition to the above, the following environment variables will be available to the batch job.
[...]
PBS_ARRAYID
each member of a job array is assigned a unique identifier (see -t)
[...]
OPTIONS
[...]
-t array_request
Specifies the task ids of a job array. Single task arrays are allowed.
The array_request argument is an integer id or a range of integers. Multiple ids or id ranges can be combined in a comman delimeted list. Examples : -t 1-100 or -t 1,10,50-100
[...]
Addendum II:
I have tried the torque solution given by Dmitri Chubarov but it does not work as described.
Without the job arrray it works as expected:
testuser@headnode ~ $ qsub -W depend=afterok:`qsub ./test1.sh` ./test2 && qstat
2553.testserver.domain
Job id Name User Time Use S Queue
----------------------- ---------------- --------------- -------- - -----
2552.testserver Test1 testuser 0 Q testqueue
2553.testserver Test2 testuser 0 H testqueue
testuser@headnode ~ $ qstat
Job id Name User Time Use S Queue
----------------------- ---------------- --------------- -------- - -----
2552.testserver Test1 testuser 0 R testqueue
2553.testserver Test2 testuser 0 H testqueue
testuser@headnode ~ $ qstat
Job id Name User Time Use S Queue
----------------------- ---------------- --------------- -------- - -----
2553.testserver Test2 testuser 0 R testqueue
However, using job arrays the second job won't start:
testuser@headnode ~ $ qsub -W depend=afterok:`qsub -t 1-2 ./test1.sh` ./test2 && qstat
2555.testserver.domain
Job id Name User Time Use S Queue
----------------------- ---------------- --------------- -------- - -----
2554-1.testserver Test1-1 testuser 0 Q testqueue
2554-2.testserver Test1-1 testuser 0 Q testqueue
2555.testserver Test2 testuser 0 H testqueue
testuser@headnode ~ $ qstat
Job id Name User Time Use S Queue
----------------------- ---------------- --------------- -------- - -----
2554-1.testserver Test1-1 testuser 0 R testqueue
2554-2.testserver Test1-2 testuser 0 R testqueue
2555.testserver Test2 testuser 0 H testqueue
testuser@headnode ~ $ qstat
Job id Name User Time Use S Queue
----------------------- ---------------- --------------- -------- - -----
2555.testserver Test2 testuser 0 H testqueue
I guess this is due to the lack of array indication in the job id that is returned by the first qsub
:
testuser@headnode ~ $ qsub -t 1-2 ./test1.sh
2556.testserver.domain
As you can see there is no ...[]
indicating this being a job array.
Also, in the qsub
output there are no ...[]
s but ...-1
and ...-2
indicating the array.
So the remaining question is how to format -W depend=afterok:...
to make a job depend on a specified job array.
--version
reports nothing but a version number and the man page doesn't seem to contain any details, either. If there is no way to find out on myself I could email the administrator of the cluster as a last resort. – Surpass--version
and--help
. Hopefully that does any good. – Surpass