<div dir="ltr">Hi all - I'm running the latest version of starcluster from github and using the loadbalance feature. I have 10 jobs in the queue, with 1 running.<div><br></div><div style>starcluster just tried adding a node and failed as follows:</div>
<div style><br></div><div style><div>>>> Loading full job history</div><div>Execution hosts: 1</div><div>Queued jobs: 10</div><div>Oldest queued job: 2013-07-02 13:33:29</div><div>Avg job duration: 179 secs</div>
<div>Avg job wait time: 119 secs</div><div>Last cluster modification time: 2013-07-02 13:36:42</div><div>>>> A job has been waiting for 923 sec, longer than max 900</div><div>*** WARNING - Adding 1 nodes at 2013-07-02 13:48:52.123504</div>
<div>>>> Launching node(s): node001</div><div>SpotInstanceRequest:sir-0c581634</div><div>>>> Waiting for node(s) to come up... (updating every 30s)</div><div>>>> Waiting for all nodes to be in a 'running' state...</div>
<div>1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100% </div><div>>>> Waiting for SSH to come up on all nodes...</div><div>1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100% </div>
<div>>>> Waiting for cluster to come up took 0.021 mins</div><div>!!! ERROR - Failed to add new host</div><div>Traceback (most recent call last):</div><div> File "/usr/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/balancers/sge/__init__.py", line 666, in _eval_add_node</div>
<div> self._cluster.add_nodes(need_to_add)</div><div> File "/usr/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/cluster.py", line 888, in add_nodes</div><div> node = self.get_node_by_alias(alias)</div>
<div> File "/usr/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/cluster.py", line 732, in get_node_by_alias</div><div> raise exception.InstanceDoesNotExist(alias, label='node')</div>
<div>InstanceDoesNotExist: node 'node001' does not exist</div><div>>>> Sleeping...(looping again in 60 secs)</div><div><br></div><div><br></div><div style>It looks like the node never came up:</div><div style>
<br></div><div style><div>[ec2-user@ip-10-28-206-211 ~]$ starcluster listclusters</div><div>StarCluster - (<a href="http://star.mit.edu/cluster">http://star.mit.edu/cluster</a>) (v. 0.9999)</div><div>Software Tools for Academics and Researchers (STAR)</div>
<div>Please submit bug reports to <a href="mailto:starcluster@mit.edu">starcluster@mit.edu</a></div><div><br></div><div>-------------------------------------------</div><div>ngscluster (security group: @sc-ngscluster)</div>
<div>-------------------------------------------</div><div>Launch time: 2013-07-02 13:05:52</div><div>Uptime: 0 days, 00:45:07</div><div>Zone: us-east-1a</div><div>Keypair: aws_starcluster_keypair</div><div>Spot requests: 1 open<br>
</div><div>Cluster nodes:</div><div> master running i-65a6f305 <a href="http://ec2-50-19-10-231.compute-1.amazonaws.com">ec2-50-19-10-231.compute-1.amazonaws.com</a></div><div>Total nodes: 1</div><div><br></div><div><br>
</div><div style>I thought this might be a spot history pricing problem, but my max price is higher than the avg price. Now when I try to rerun loadbalance, I get the error:</div><div style><br></div><div style><div>[ec2-user@ip-10-28-206-211 ~]$ starcluster loadbalance -m 20 ngscluster</div>
<div>StarCluster - (<a href="http://star.mit.edu/cluster">http://star.mit.edu/cluster</a>) (v. 0.9999)</div><div>Software Tools for Academics and Researchers (STAR)</div><div>Please submit bug reports to <a href="mailto:starcluster@mit.edu">starcluster@mit.edu</a></div>
<div><br></div><div>!!! ERROR - cluster ngscluster is not running</div><div><br></div><div style>However listclusters says its running (and surprisingly node001 is there too):</div><div style><br></div><div style><div>[ec2-user@ip-10-28-206-211 ~]$ starcluster listclusters</div>
<div>StarCluster - (<a href="http://star.mit.edu/cluster">http://star.mit.edu/cluster</a>) (v. 0.9999)</div><div>Software Tools for Academics and Researchers (STAR)</div><div>Please submit bug reports to <a href="mailto:starcluster@mit.edu">starcluster@mit.edu</a></div>
<div><br></div><div>-------------------------------------------</div><div>ngscluster (security group: @sc-ngscluster)</div><div>-------------------------------------------</div><div>Launch time: 2013-07-02 13:05:52</div><div>
Uptime: 0 days, 00:50:33</div><div>Zone: us-east-1a</div><div>Keypair: aws_starcluster_keypair</div><div>EBS volumes:</div><div> vol-b46254c9 on master:/dev/sdz (status: attached)</div><div>Spot requests: 1 active</div>
<div>Cluster nodes:</div><div> master running i-65a6f305 <a href="http://ec2-50-19-10-231.compute-1.amazonaws.com">ec2-50-19-10-231.compute-1.amazonaws.com</a></div><div> node001 running i-c5886faa <a href="http://ec2-107-21-176-10.compute-1.amazonaws.com">ec2-107-21-176-10.compute-1.amazonaws.com</a> (spot sir-0c581634)</div>
<div>Total nodes: 2</div><div><br></div><div style>qhost on the cluster doesn't see node001, so I tried to remove the node with removenode.</div><div style><br></div><div style><div>[ec2-user@ip-10-28-206-211 ~]$ starcluster removenode ngscluster node001 </div>
<div>StarCluster - (<a href="http://star.mit.edu/cluster">http://star.mit.edu/cluster</a>) (v. 0.9999)</div><div>Software Tools for Academics and Researchers (STAR)</div><div>Please submit bug reports to <a href="mailto:starcluster@mit.edu">starcluster@mit.edu</a></div>
<div><br></div><div>>>> Running plugin setupuserenv.SetupUserEnvironment</div><div>>>> Running plugin starcluster.plugins.users.CreateUsers</div><div>>>> Running plugin starcluster.plugins.sge.SGEPlugin</div>
<div>>>> Removing node001 from SGE</div><div>!!! ERROR - Error occured while running plugin 'starcluster.plugins.sge.SGEPlugin':</div><div>!!! ERROR - remote command 'source /etc/profile && qconf -dconf node001'</div>
<div>!!! ERROR - failed with status 1:</div><div>!!! ERROR - can't resolve hostname "node001"</div><div>!!! ERROR - can't delete configuration "node001" from list:</div><div>!!! ERROR - configuration does not exist</div>
<div><br></div><div><br></div><div style>How do I get starcluster back in a working state? I *just* started this cluster...</div><div style><br></div><div><br></div></div></div></div></div></div></div>