<div dir="ltr">Hi fellows,<div><br></div><div style>I have a cluster of 5 nodes (cc1.4xlarge) running two jobs. Each one with 40 nodes.</div><div style>I´m trying to use loadbalancer to kill the cluster after the jobs are done.</div>
<div style>One strange thing is despite of the jobs are running in the queue, as you can see here:</div><div style><br></div><div style><div>queuename qtype resv/used/tot. load_avg arch states</div>
<div>---------------------------------------------------------------------------------</div><div>all.q@master BIP 0/8/16 0.42 linux-x64</div><div> 2 0.55500 serra85 sgeadmin r 02/09/2013 11:52:17 8</div>
<div>---------------------------------------------------------------------------------</div><div>all.q@node001 BIP 0/8/1 -NA- linux-x64 auo</div><div> 2 0.55500 serra85 sgeadmin r 02/09/2013 11:52:17 8</div>
<div>---------------------------------------------------------------------------------</div><div>all.q@node002 BIP 0/8/1 -NA- linux-x64 auo</div><div> 2 0.55500 serra85 sgeadmin r 02/09/2013 11:52:17 8</div>
<div>---------------------------------------------------------------------------------</div><div>all.q@node003 BIP 0/8/1 -NA- linux-x64 auo</div><div> 2 0.55500 serra85 sgeadmin r 02/09/2013 11:52:17 8</div>
<div>---------------------------------------------------------------------------------</div><div>all.q@node004 BIP 0/8/1 -NA- linux-x64 auo</div><div> 2 0.55500 serra85 sgeadmin r 02/09/2013 11:52:17 8</div>
<div><br></div><div style>If I issue this command in StarCluster: $ starcluster loadbalance newave -n 1</div><div><br></div><div style>this is what I´ve got:</div></div><div style><br></div><div style><div>>>> Loading full job history</div>
<div>Execution hosts: 5</div><div>Queued jobs: 0</div><div>Avg job duration: 0 secs</div><div>Avg job wait time: 0 secs</div><div>Last cluster modification time: 2013-02-09 15:32:21</div><div>>>> Not adding nodes: already at or above maximum (5)</div>
<div>>>> Looking for nodes to remove...</div><div>>>> No nodes can be removed at this time</div><div>>>> Sleeping...(looping again in 60 secs)</div><div><br></div><div style>It seems that LoadBalancer didn´t got the right Avg Job Duration and can kill the cluster wrongly, even though that is jobs running.</div>
<div style><br></div><div style>All the best,</div><div style><br></div><div style>Sergio</div></div></div>