[StarCluster] loadbalance

Tue Jul 2 10:00:41 EDT 2013

I just reran 'starcluster loadbalance', and it it detecting the cluster
now...odd.

On Tue, Jul 2, 2013 at 9:59 AM, Ryan Golhar <ngsbioinformatics at gmail.com>wrote:

> Hi all - I'm running the latest version of starcluster from github and
> using the loadbalance feature.  I have 10 jobs in the queue, with 1 running.
>
> starcluster just tried adding a node and failed as follows:
>
> >>> Loading full job history
> Execution hosts: 1
> Queued jobs: 10
> Oldest queued job: 2013-07-02 13:33:29
> Avg job duration: 179 secs
> Avg job wait time: 119 secs
> Last cluster modification time: 2013-07-02 13:36:42
> >>> A job has been waiting for 923 sec, longer than max 900
> *** WARNING - Adding 1 nodes at 2013-07-02 13:48:52.123504
> >>> Launching node(s): node001
> SpotInstanceRequest:sir-0c581634
> >>> Waiting for node(s) to come up... (updating every 30s)
> >>> Waiting for all nodes to be in a 'running' state...
> 1/1 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> 100%
> >>> Waiting for SSH to come up on all nodes...
> 1/1 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> 100%
> >>> Waiting for cluster to come up took 0.021 mins
> !!! ERROR - Failed to add new host
> Traceback (most recent call last):
>   File
> "/usr/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/balancers/sge/__init__.py",
> line 666, in _eval_add_node
>     self._cluster.add_nodes(need_to_add)
>   File
> "/usr/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/cluster.py",
> line 888, in add_nodes
>     node = self.get_node_by_alias(alias)
>   File
> "/usr/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/cluster.py",
> line 732, in get_node_by_alias
>     raise exception.InstanceDoesNotExist(alias, label='node')
> InstanceDoesNotExist: node 'node001' does not exist
> >>> Sleeping...(looping again in 60 secs)
>
>
> It looks like the node never came up:
>
> [ec2-user at ip-10-28-206-211 ~]$ starcluster listclusters
> StarCluster - (http://star.mit.edu/cluster) (v. 0.9999)
> Software Tools for Academics and Researchers (STAR)
> Please submit bug reports to starcluster at mit.edu
>
> -------------------------------------------
> ngscluster (security group: @sc-ngscluster)
> -------------------------------------------
> Launch time: 2013-07-02 13:05:52
> Uptime: 0 days, 00:45:07
> Zone: us-east-1a
> Keypair: aws_starcluster_keypair
> Spot requests: 1 open
> Cluster nodes:
>      master running i-65a6f305 ec2-50-19-10-231.compute-1.amazonaws.com
> Total nodes: 1
>
>
> I thought this might be a spot history pricing problem, but my max price
> is higher than the avg price.  Now when I try to rerun loadbalance, I get
> the error:
>
> [ec2-user at ip-10-28-206-211 ~]$ starcluster loadbalance -m 20 ngscluster
> StarCluster - (http://star.mit.edu/cluster) (v. 0.9999)
> Software Tools for Academics and Researchers (STAR)
> Please submit bug reports to starcluster at mit.edu
>
> !!! ERROR - cluster ngscluster is not running
>
> However listclusters says its running (and surprisingly node001 is there
> too):
>
> [ec2-user at ip-10-28-206-211 ~]$ starcluster listclusters
> StarCluster - (http://star.mit.edu/cluster) (v. 0.9999)
> Software Tools for Academics and Researchers (STAR)
> Please submit bug reports to starcluster at mit.edu
>
> -------------------------------------------
> ngscluster (security group: @sc-ngscluster)
> -------------------------------------------
> Launch time: 2013-07-02 13:05:52
> Uptime: 0 days, 00:50:33
> Zone: us-east-1a
> Keypair: aws_starcluster_keypair
> EBS volumes:
>     vol-b46254c9 on master:/dev/sdz (status: attached)
> Spot requests: 1 active
> Cluster nodes:
>      master running i-65a6f305 ec2-50-19-10-231.compute-1.amazonaws.com
>     node001 running i-c5886faa ec2-107-21-176-10.compute-1.amazonaws.com(spot sir-0c581634)
> Total nodes: 2
>
> qhost on the cluster doesn't see node001, so I tried to remove the node
>  with removenode.
>
> [ec2-user at ip-10-28-206-211 ~]$ starcluster removenode ngscluster node001
> StarCluster - (http://star.mit.edu/cluster) (v. 0.9999)
> Software Tools for Academics and Researchers (STAR)
> Please submit bug reports to starcluster at mit.edu
>
> >>> Running plugin setupuserenv.SetupUserEnvironment
> >>> Running plugin starcluster.plugins.users.CreateUsers
> >>> Running plugin starcluster.plugins.sge.SGEPlugin
> >>> Removing node001 from SGE
> !!! ERROR - Error occured while running plugin
> 'starcluster.plugins.sge.SGEPlugin':
> !!! ERROR - remote command 'source /etc/profile && qconf -dconf node001'
> !!! ERROR - failed with status 1:
> !!! ERROR - can't resolve hostname "node001"
> !!! ERROR - can't delete configuration "node001" from list:
> !!! ERROR - configuration does not exist
>
>
> How do I get starcluster back in a working state?  I *just* started this
> cluster...
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20130702/5b875198/attachment-0001.htm