[StarCluster] StarCluster LoadBalancer

Sergio Mafra sergiohmafra at gmail.com
Sat Feb 9 10:47:43 EST 2013


Hi fellows,

I have a cluster of 5 nodes (cc1.4xlarge) running two jobs. Each one with
40 nodes.
I´m trying to use loadbalancer to kill the cluster after the jobs are done.
One strange thing is despite of the jobs are running in the queue, as you
can see here:

queuename                      qtype resv/used/tot. load_avg arch
 states
---------------------------------------------------------------------------------
all.q at master                   BIP   0/8/16         0.42     linux-x64
      2 0.55500 serra85    sgeadmin     r     02/09/2013 11:52:17     8
---------------------------------------------------------------------------------
all.q at node001                  BIP   0/8/1          -NA-     linux-x64
auo
      2 0.55500 serra85    sgeadmin     r     02/09/2013 11:52:17     8
---------------------------------------------------------------------------------
all.q at node002                  BIP   0/8/1          -NA-     linux-x64
auo
      2 0.55500 serra85    sgeadmin     r     02/09/2013 11:52:17     8
---------------------------------------------------------------------------------
all.q at node003                  BIP   0/8/1          -NA-     linux-x64
auo
      2 0.55500 serra85    sgeadmin     r     02/09/2013 11:52:17     8
---------------------------------------------------------------------------------
all.q at node004                  BIP   0/8/1          -NA-     linux-x64
auo
      2 0.55500 serra85    sgeadmin     r     02/09/2013 11:52:17     8

If I issue this command in StarCluster: $ starcluster loadbalance newave -n
1

this is what I´ve got:

>>> Loading full job history
Execution hosts: 5
Queued jobs: 0
Avg job duration: 0 secs
Avg job wait time: 0 secs
Last cluster modification time: 2013-02-09 15:32:21
>>> Not adding nodes: already at or above maximum (5)
>>> Looking for nodes to remove...
>>> No nodes can be removed at this time
>>> Sleeping...(looping again in 60 secs)

It seems that LoadBalancer didn´t got the right Avg Job Duration and can
kill the cluster wrongly, even though that is jobs running.

All the best,

Sergio
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20130209/e8f930d7/attachment.htm


More information about the StarCluster mailing list