Does a PBS batch system move multiple serial jobs across nodes?

Asked 28/3, 2011 at 0:10 Answered 7/4, 2012 at 19:53

If I need to run many serial programs "in parallel" (because the problem is simple but time consuming - I need to read in many different data sets for the same program), the solution is simple if I only use one node. All I do is keep submitting serial jobs with an ampersand after each command, e.g. in the job script:

./program1 &
./program2 &
./program3 &
./program4

which will naturally run each serial program on a different processor. This works well on a login server or standalone workstation, and of course for a batch job asking for only one node.

But what if I need to run 110 different instances of the same program to read 110 different data sets? If I submit to multiple nodes (say 14) with a script which submits 110 ./program# commands, will the batch system run each job on a different processor across the different nodes, or will it try to run them all on the same, 8 core node?

I have tried to use a simple MPI code to read different data, but various errors result, with about 100 out of the 110 processes succeeding, and the others crashing. I have also considered job arrays, but I'm not sure if my system supports it.

I have tested the serial program extensively on individual data sets - there are no runtime errors, and I do not exceed the available memory on each node.

Aggy answered 28/3, 2011 at 0:10 Comment(0)

No, PBS won't automatically distribute the jobs among nodes for you. But this is a common thing to want to do, and you have a few options.

Easiest and in some ways most advantagous for you is to bunch the tasks into 1-node sized chunks, and submit those bundles as individual jobs. This will get your jobs started faster; a 1-node job will normally get scheduled faster than a (say) 14 node job, just because there's more one-node sized holes in the schedule than 14. This works particularly well if all the jobs take roughly the same amount of time, because then doing the division is pretty simple.
If you do want to do it all in one job (say, to simplify the bookkeeping), you may or may not have access to the pbsdsh command; there's a good discussion of it here. This lets you run a single script on all the processors in your job. You then write a script which queries $PBS_VNODENUM to find out which of the nnodes*ppn jobs it is, and runs the appropriate task.
If not pbsdsh, Gnu parallel is another tool which can enormously simplify these tasks. It's like xargs, if you're familiar with that, but will run commands in parallel, including on multiple nodes. So you'd submit your (say) 14-node job and have the first node run a gnu parallel script. The nice thing is that this will do scheduling for you even if the jobs are not all of the same length. The advice we give to users on our system for using gnu parallel for these sorts of things is here. Note that if gnu parallel isn't installed on your system, and for some reason your sysadmins won't do it, you can set it up in your home directory, it's not a complicated build.

Brakeman answered 28/3, 2011 at 0:43 Comment(1)

Thanks very much, I'm now implementing some of your suggestions. – Aggy 28/3, 2011 at 17:4

You should consider job arrays.

Briefly, you insert #PBS -t 0-109 in your shell script (where the range 0-109 can be any integer range you want, but you stated you had 110 datasets) and torque will:

run 110 instances of your script, allocating each with the resources you specify (in the script with #PBS tags or as arguments when you submit).
assign a unique integer from 0 to 109 to the environment variable PBS_ARRAYID for each job.

Assuming you have access to environment variables within the code, you can just tell each job to run on data set number PBS_ARRAYID.

Inhaler answered 7/4, 2012 at 19:53 Comment(0)

Recommended topics

Hot tags