[StarCluster] loadbalance

Ryan Golhar ngsbioinformatics at gmail.com
Tue Jul 2 09:59:22 EDT 2013

Hi all - I'm running the latest version of starcluster from github and
using the loadbalance feature.  I have 10 jobs in the queue, with 1 running.

starcluster just tried adding a node and failed as follows:

>>> Loading full job history
Execution hosts: 1
Queued jobs: 10
Oldest queued job: 2013-07-02 13:33:29
Avg job duration: 179 secs
Avg job wait time: 119 secs
Last cluster modification time: 2013-07-02 13:36:42
>>> A job has been waiting for 923 sec, longer than max 900
*** WARNING - Adding 1 nodes at 2013-07-02 13:48:52.123504
>>> Launching node(s): node001
>>> Waiting for node(s) to come up... (updating every 30s)
>>> Waiting for all nodes to be in a 'running' state...
1/1 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
>>> Waiting for SSH to come up on all nodes...
1/1 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
>>> Waiting for cluster to come up took 0.021 mins
!!! ERROR - Failed to add new host
Traceback (most recent call last):
line 666, in _eval_add_node
line 888, in add_nodes
    node = self.get_node_by_alias(alias)
line 732, in get_node_by_alias
    raise exception.InstanceDoesNotExist(alias, label='node')
InstanceDoesNotExist: node 'node001' does not exist
>>> Sleeping...(looping again in 60 secs)

It looks like the node never came up:

[ec2-user at ip-10-28-206-211 ~]$ starcluster listclusters
StarCluster - (http://star.mit.edu/cluster) (v. 0.9999)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster at mit.edu

ngscluster (security group: @sc-ngscluster)
Launch time: 2013-07-02 13:05:52
Uptime: 0 days, 00:45:07
Zone: us-east-1a
Keypair: aws_starcluster_keypair
Spot requests: 1 open
Cluster nodes:
     master running i-65a6f305 ec2-50-19-10-231.compute-1.amazonaws.com
Total nodes: 1

I thought this might be a spot history pricing problem, but my max price is
higher than the avg price.  Now when I try to rerun loadbalance, I get the

[ec2-user at ip-10-28-206-211 ~]$ starcluster loadbalance -m 20 ngscluster
StarCluster - (http://star.mit.edu/cluster) (v. 0.9999)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster at mit.edu

!!! ERROR - cluster ngscluster is not running

However listclusters says its running (and surprisingly node001 is there

[ec2-user at ip-10-28-206-211 ~]$ starcluster listclusters
StarCluster - (http://star.mit.edu/cluster) (v. 0.9999)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster at mit.edu

ngscluster (security group: @sc-ngscluster)
Launch time: 2013-07-02 13:05:52
Uptime: 0 days, 00:50:33
Zone: us-east-1a
Keypair: aws_starcluster_keypair
EBS volumes:
    vol-b46254c9 on master:/dev/sdz (status: attached)
Spot requests: 1 active
Cluster nodes:
     master running i-65a6f305 ec2-50-19-10-231.compute-1.amazonaws.com
    node001 running i-c5886faa
ec2-107-21-176-10.compute-1.amazonaws.com(spot sir-0c581634)
Total nodes: 2

qhost on the cluster doesn't see node001, so I tried to remove the node
 with removenode.

[ec2-user at ip-10-28-206-211 ~]$ starcluster removenode ngscluster node001
StarCluster - (http://star.mit.edu/cluster) (v. 0.9999)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster at mit.edu

>>> Running plugin setupuserenv.SetupUserEnvironment
>>> Running plugin starcluster.plugins.users.CreateUsers
>>> Running plugin starcluster.plugins.sge.SGEPlugin
>>> Removing node001 from SGE
!!! ERROR - Error occured while running plugin
!!! ERROR - remote command 'source /etc/profile && qconf -dconf node001'
!!! ERROR - failed with status 1:
!!! ERROR - can't resolve hostname "node001"
!!! ERROR - can't delete configuration "node001" from list:
!!! ERROR - configuration does not exist

How do I get starcluster back in a working state?  I *just* started this
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20130702/69c35923/attachment.htm

More information about the StarCluster mailing list