[StarCluster] Problem with SGE: jobs crash after running for sometime

Santosh Kumar Divvala santosh.divvala at gmail.com
Fri Nov 9 17:48:45 EST 2012


Hello,

I have recently started using starcluster for scheduling my jobs on ec2.

I am using the following command to run my jobs:
qsub -N MultiMatlab -pe orte 8 -e /home/ubuntu/outputs/ -o
/home/ubuntu/outputs/ -j y <job to run>

where <job to run> is a matlab compiled binary (generated using
http://www.mathworks.com/help/toolbox/compiler/mcc.html and run using
http://www.mathworks.com/products/compiler/mcr/index.html) that
internally uses the matlab 'parfor'
(http://www.mathworks.com/help/distcomp/parfor.html).

Although I am able to successfully schedule my jobs, many of them are
crashing after running for sometime on the nodes. (I have made sure
that I am not exceeding the amount of memory/cpu resources on each
node.)
I have included the qstat output before/after job3 on node002 crashed.
(In this case, I have started four similar/identical jobs on each of
the four nodes.)
The output of "qstat -explain a" (also included below) indicates the
error as "error: no value for 'np_load_avg' because execd is in
unknown state".
I have tried modifying the queue configuration using "qconf -mq" but
to no avail. I have included qconf output below. (set
np_load_avg=11.75 instead of the default 1.75)

I was wondering if there are any suggestions for fixing this issue.
Could you kindly let me know.

[21:55:49Fri Nov 09~]qstat -f
queuename                      qtype resv/used/tot. load_avg arch
    states
---------------------------------------------------------------------------------
all.q at node001                  BIP   0/8/8          35.28    linux-x64
      2 0.55500 MultiMatla ubuntu       r     11/09/2012 21:39:31     8
---------------------------------------------------------------------------------
all.q at node002                  BIP   0/8/8          31.54    linux-x64
      3 0.55500 MultiMatla ubuntu       r     11/09/2012 21:40:46     8
---------------------------------------------------------------------------------
all.q at node003                  BIP   0/8/8          37.81    linux-x64
      4 0.55500 MultiMatla ubuntu       r     11/09/2012 21:41:01     8
---------------------------------------------------------------------------------
all.q at node004                  BIP   0/8/8          21.15    linux-x64
      5 0.55500 MultiMatla ubuntu       r     11/09/2012 21:41:16     8


[21:55:51Fri Nov 09~]qstat -f
queuename                      qtype resv/used/tot. load_avg arch
    states
---------------------------------------------------------------------------------
all.q at node001                  BIP   0/8/8          35.07    linux-x64
      2 0.55500 MultiMatla ubuntu       r     11/09/2012 21:39:31     8
---------------------------------------------------------------------------------
all.q at node002                  BIP   0/8/8          -NA-     linux-x64     au
      3 0.55500 MultiMatla ubuntu       r     11/09/2012 21:40:46     8
---------------------------------------------------------------------------------
all.q at node003                  BIP   0/8/8          38.70    linux-x64
      4 0.55500 MultiMatla ubuntu       r     11/09/2012 21:41:01     8
---------------------------------------------------------------------------------
all.q at node004                  BIP   0/8/8          20.34    linux-x64
      5 0.55500 MultiMatla ubuntu       r     11/09/2012 21:41:16     8


[21:56:46Fri Nov 09~]qstat -explain a
queuename                      qtype resv/used/tot. load_avg arch
    states
---------------------------------------------------------------------------------
all.q at node001                  BIP   0/8/8          35.07    linux-x64
      2 0.55500 MultiMatla ubuntu       r     11/09/2012 21:39:31     8
---------------------------------------------------------------------------------
all.q at node002                  BIP   0/8/8          -NA-     linux-x64     au
 error: no value for "np_load_avg" because execd is in unknown state
      3 0.55500 MultiMatla ubuntu       r     11/09/2012 21:40:46     8
---------------------------------------------------------------------------------
all.q at node003                  BIP   0/8/8          38.70    linux-x64
      4 0.55500 MultiMatla ubuntu       r     11/09/2012 21:41:01     8
---------------------------------------------------------------------------------
all.q at node004                  BIP   0/8/8          20.34    linux-x64
      5 0.55500 MultiMatla ubuntu       r     11/09/2012 21:41:16     8

root at master:~# qconf -mq all.q
qname                 all.q
hostlist              @allhosts
seq_no                0
load_thresholds       np_load_avg=11.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make orte
rerun                 FALSE
slots                 1,[node001=8],[node002=8],[node003=8],[node004=8]
tmpdir                /tmp
shell                 /bin/bash
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant

thanks,
~Santosh


More information about the StarCluster mailing list