[StarCluster] Problem with SGE: jobs crash after running for sometime
Santosh Kumar Divvala
santosh.divvala at gmail.com
Fri Nov 9 17:48:45 EST 2012
Hello,
I have recently started using starcluster for scheduling my jobs on ec2.
I am using the following command to run my jobs:
qsub -N MultiMatlab -pe orte 8 -e /home/ubuntu/outputs/ -o
/home/ubuntu/outputs/ -j y <job to run>
where <job to run> is a matlab compiled binary (generated using
http://www.mathworks.com/help/toolbox/compiler/mcc.html and run using
http://www.mathworks.com/products/compiler/mcr/index.html) that
internally uses the matlab 'parfor'
(http://www.mathworks.com/help/distcomp/parfor.html).
Although I am able to successfully schedule my jobs, many of them are
crashing after running for sometime on the nodes. (I have made sure
that I am not exceeding the amount of memory/cpu resources on each
node.)
I have included the qstat output before/after job3 on node002 crashed.
(In this case, I have started four similar/identical jobs on each of
the four nodes.)
The output of "qstat -explain a" (also included below) indicates the
error as "error: no value for 'np_load_avg' because execd is in
unknown state".
I have tried modifying the queue configuration using "qconf -mq" but
to no avail. I have included qconf output below. (set
np_load_avg=11.75 instead of the default 1.75)
I was wondering if there are any suggestions for fixing this issue.
Could you kindly let me know.
[21:55:49Fri Nov 09~]qstat -f
queuename qtype resv/used/tot. load_avg arch
states
---------------------------------------------------------------------------------
all.q at node001 BIP 0/8/8 35.28 linux-x64
2 0.55500 MultiMatla ubuntu r 11/09/2012 21:39:31 8
---------------------------------------------------------------------------------
all.q at node002 BIP 0/8/8 31.54 linux-x64
3 0.55500 MultiMatla ubuntu r 11/09/2012 21:40:46 8
---------------------------------------------------------------------------------
all.q at node003 BIP 0/8/8 37.81 linux-x64
4 0.55500 MultiMatla ubuntu r 11/09/2012 21:41:01 8
---------------------------------------------------------------------------------
all.q at node004 BIP 0/8/8 21.15 linux-x64
5 0.55500 MultiMatla ubuntu r 11/09/2012 21:41:16 8
[21:55:51Fri Nov 09~]qstat -f
queuename qtype resv/used/tot. load_avg arch
states
---------------------------------------------------------------------------------
all.q at node001 BIP 0/8/8 35.07 linux-x64
2 0.55500 MultiMatla ubuntu r 11/09/2012 21:39:31 8
---------------------------------------------------------------------------------
all.q at node002 BIP 0/8/8 -NA- linux-x64 au
3 0.55500 MultiMatla ubuntu r 11/09/2012 21:40:46 8
---------------------------------------------------------------------------------
all.q at node003 BIP 0/8/8 38.70 linux-x64
4 0.55500 MultiMatla ubuntu r 11/09/2012 21:41:01 8
---------------------------------------------------------------------------------
all.q at node004 BIP 0/8/8 20.34 linux-x64
5 0.55500 MultiMatla ubuntu r 11/09/2012 21:41:16 8
[21:56:46Fri Nov 09~]qstat -explain a
queuename qtype resv/used/tot. load_avg arch
states
---------------------------------------------------------------------------------
all.q at node001 BIP 0/8/8 35.07 linux-x64
2 0.55500 MultiMatla ubuntu r 11/09/2012 21:39:31 8
---------------------------------------------------------------------------------
all.q at node002 BIP 0/8/8 -NA- linux-x64 au
error: no value for "np_load_avg" because execd is in unknown state
3 0.55500 MultiMatla ubuntu r 11/09/2012 21:40:46 8
---------------------------------------------------------------------------------
all.q at node003 BIP 0/8/8 38.70 linux-x64
4 0.55500 MultiMatla ubuntu r 11/09/2012 21:41:01 8
---------------------------------------------------------------------------------
all.q at node004 BIP 0/8/8 20.34 linux-x64
5 0.55500 MultiMatla ubuntu r 11/09/2012 21:41:16 8
root at master:~# qconf -mq all.q
qname all.q
hostlist @allhosts
seq_no 0
load_thresholds np_load_avg=11.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list make orte
rerun FALSE
slots 1,[node001=8],[node002=8],[node003=8],[node004=8]
tmpdir /tmp
shell /bin/bash
prolog NONE
epilog NONE
shell_start_mode posix_compliant
thanks,
~Santosh
More information about the StarCluster
mailing list