excluding nodes from qsub command under sge
Asked Answered
W

3

21

I have more than 200 jobs I need to submit to and sge cluster. I'll be submitting them into two ques. One of the ques have a machine that I don't want to submit jobs to. How can I exclude that machine? The only thing I found that might be helpful is (assuming three valid nodes available to q1 and all the available nodes for q2 are valid):

qsub -q q1.q@n1 q1.q@n2 q1.q@n3 q2.q
Walkon answered 13/12, 2012 at 13:48 Comment(0)
W
-5

There is a nice bypass to this.

Generate a simple bash file:

#!/bin/bash
sleep 6000 #replace 6000 with any long period of time that will be enough to submit your jobs

submit this jobs to the node you wish to exclude until they fully occupy it.

Voila, your node is exclude.

Walkon answered 13/12, 2012 at 15:23 Comment(4)
This is a hack but the only solution that works for me (+1). I tried a dozen qsub variants but they either make no difference or result in an error....Bindery
This is a terrible advice on bigger, shared clusters.Fungiform
This, combined with the -q argument to pick the exact machine to exclude is a reasonable hack IMO.Clavicorn
this is truly evil!Polysyndeton
L
29

Assuming you don't want to run it on is called n4 then adding the following to your script should work.

#$ -l h=!n4

If you add the -l option to the qsub command line rather than embedding it in the submitted script most shells would require the exclamation mark to be quoted.

Lowestoft answered 17/10, 2013 at 7:57 Comment(7)
I get "qsub: submit error (Unknown resource type Resource_List.h)"Quentinquercetin
Thanks. How can you do this to two hostnames? #$ -l h=!n4 h!=n5 or #$ -l h!=n4,n5 don't workGearhart
h=!h4&!h5 or h=!(h4|h5) should do it.Lowestoft
-l h='!n4' for me.Ionone
@dranxo, Have you solved this issue? I also meet the same problemPosthorse
@dranxo, Same problem here. Is there a solution (apart from hacky solution below)?Pietrek
The Resource_List.h sounds more like something Torque (or another PBS variant) would spit out not gridengine. I suggest checking the man page (man qsub) to see which batch scheduiler you are using and then post a similar question mentioning that rather than sge/gridengine. There is one for pbs/torque here #9263827Lowestoft
S
3

The best way I've found for this is to set up a custom resource on the nodes that you want to allow the execution on, then require that resource when you submit the job.

In qmon, go to the "complex" configuration and add a new attribute. Set the name to something like "my_allowed" and the shortcut to something like "m_a", the type to BOOL, the relation to ==, requestable to Yes, consumable to No, and "Add" it. Commit your changes to the complex configurations.

The next step is probably easier to do from the command line, but you can do it in qmon, as well. You need to add your consumable to each host that you're going to allow your job to run on. In qmon, you can go to the host configuration, select execution host, and open each host in turn, clicking on the consumables/fixed attributes tab and adding the new complex that you just configured above with "True" as the value. From the command line, you can get a list of your execution hosts with "qconf -sel". This list is suitable for passing to a loop and grepping out the host(s) you don't want included. Do something like this:

qconf -sel | grep -v host_to_exclude | while read host; do
    EDITOR="ed" qconf -me $h <<EOL
/complex_values/s/$/,my_test=True/
w
q
EOL
done

This lets you programmatically edit the host (not normally allowed by qconf as it wants to start up your editor for you). It does this by setting the editor to "ed" (you'll have to make sure you have the ed editor installed... try running it by hand first... type "q" to get out). ed takes the list of editing commands on it's stdin, so we give it three commands. The first edits the line with the complex_values on it to include the my_test value. The second writes out the temporary file and the third quits ed.

Once you've done this, submit your jobs with a limit option that requires your new complex:

qsub -q whatever -l my_test=True my_prog.sh

The -l option sets a limit and the my_test=True says the job can only run on hosts that have the complex my_test with a value of True. Since the complex isn't consumable, it can still run as many jobs on each host as it wants to (up to the slot limit for the hosts), but it will avoid any hosts that don't have the my_test complex set to True.

Spence answered 13/12, 2012 at 17:45 Comment(0)
W
-5

There is a nice bypass to this.

Generate a simple bash file:

#!/bin/bash
sleep 6000 #replace 6000 with any long period of time that will be enough to submit your jobs

submit this jobs to the node you wish to exclude until they fully occupy it.

Voila, your node is exclude.

Walkon answered 13/12, 2012 at 15:23 Comment(4)
This is a hack but the only solution that works for me (+1). I tried a dozen qsub variants but they either make no difference or result in an error....Bindery
This is a terrible advice on bigger, shared clusters.Fungiform
This, combined with the -q argument to pick the exact machine to exclude is a reasonable hack IMO.Clavicorn
this is truly evil!Polysyndeton

© 2022 - 2024 — McMap. All rights reserved.