SGE: Jobs stuck in qw state
Asked Answered
L

1

8

I'm trying to submit jobs to SGE. It has been working for me the same way in the past. Now instead, all jobs are stuck in the qw state.

"qstat -g c" output:

> CLUSTER QUEUE   CQLOAD   USED  AVAIL  TOTAL
> all.q           0.38      0    160   1920   
> gpu6.q          -NA-      0      0      4    
> par6.q          0.38    750    135   1800      
> seq6.q          0.41    103    170    416   
> smp3.q          1.01      0      0     96  

"qstat" output looks like always.

Googling only gave me hints for people with root access which I don't have. Suggestions anyone?

Thanks.

Edit: Jobs were submitted via "qsub -q seq6.q scriptname" or alternatively smp3.q or par6.q.

"qstat -j jobid" gives nothing special as far as I can see:

job_number:                 2821318
exec_file:                  job_scripts/2821318
submission_time:            Wed Mar  4 12:07:15 2015
owner:                      username
uid:                        31519
group:                      dch
gid:                        1150
sge_o_home:                 /home/hudson/pg/username
sge_o_log_name:             username
sge_o_path:                 /gpfs/hamilton6/apps/intel_comp_2014/composer_xe_2013_sp1.2.144/bin/intel64:/usr/local/bin:/bin:/usr/bin:/usr/lpp/mmfs/bin:/usr/local/Cluster-Apps/sge/6.1u6/bin/lx24-amd64:/panfs/panasas1.hpc.dur.ac.uk/apps/nag/fll6a21dpl/scripts
sge_o_shell:                /bin/tcsh
sge_o_workdir:              /panfs/panasas1.hpc.dur.ac.uk/username/path
sge_o_host:                 hamilton1
account:                    sge
mail_list:                  username@hamilton1
notify:                     FALSE
job_name:                   scriptname
jobshare:                   0
hard_queue_list:            seq6.q
env_list:                   
script_file:                scriptname
scheduling info:            (Collecting of scheduler job information is turned off)
Lavatory answered 3/3, 2015 at 13:14 Comment(3)
Any insight when calling "qstat -j <jobid>" ?Dermatoglyphics
Agreed with Finch_Powers. Also, please edit post with qsub command and options used. It is difficult to solve this given so little information.Actino
Only thing I can think of is your priority is being downgraded to point of waiting, which makes no sense since slots are available. I would speak to your sysadmin to help you out.Actino
B
3

I have had the same issue today. We are running Univa Grid Engine for a customer. I configured some complexes for running jobs which are requesting much memory ( h_stack=64M, memory_free=4G,virtual_free=4G) on the masterhost. After this config jobs will hang in the waiting queue. This configuration match many years with 3G on all our execution hosts. I will test this new config (4G) next days. All servers have enough memory! Ingo

Barram answered 11/3, 2015 at 8:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.