[StarCluster] ELB: Failing to Add Spot-Bid Instances
Lyn Gerner
schedulerqueen at gmail.com
Thu Mar 7 14:23:22 EST 2013
PS: this is my version:
StarCluster - (http://web.mit.edu/starcluster) (v. 0.93.3)
On Thu, Mar 7, 2013 at 11:16 AM, Lyn Gerner <schedulerqueen at gmail.com>wrote:
> Hi All,
>
> I am experimenting with StarCluster loadbalancing of SGE using spot-priced
> instances.
>
> After the wait time threshold is exceeded, starcluster puts in a new spot
> instance request, and when it becomes available and SSHable, it then errors
> out as it tries to add the node to the cluster. Since the first
> spot-instance fails to join, after stabilization, it tries to add another
> node, with the same error result:
>
> >>> Loading full job history
> Execution hosts: 2
> Queued jobs: 18
> Oldest queued job: 2013-03-07 18:18:54
> Avg job duration: 252 secs
> Avg job wait time: 7 secs
> Last cluster modification time: 2013-03-07 18:03:11
> >>> A job has been waiting for 454 sec, longer than max 300
> *** WARNING - Adding 1 nodes at 2013-03-07 18:26:32.471812
> >>> Launching node(s): node002
> SpotInstanceRequest:sir-4763a012
> >>> Waiting for node(s) to come up... (updating every 30s)
> >>> Waiting for all nodes to be in a 'running' state...
> 2/2 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> 100%
> >>> Waiting for SSH to come up on all nodes...
> 2/2 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> 100%
> >>> Waiting for cluster to come up took 0.017 mins
> !!! ERROR - Failed to add new host
> >>> Sleeping...(looping again in 60 secs)
>
> >>> Waiting for all nodes to come up...
> >>> Waiting for all nodes to come up...
> <snip>...
>
> >>> Loading full job history
> Execution hosts: 2
> Queued jobs: 17
> Oldest queued job: 2013-03-07 18:20:48
> Avg job duration: 282 secs
> Avg job wait time: 252 secs
> Last cluster modification time: 2013-03-07 18:03:11
> >>> A job has been waiting for 798 sec, longer than max 300
> *** WARNING - Adding 1 nodes at 2013-03-07 18:34:10.519812
> >>> Launching node(s): node003
> SpotInstanceRequest:sir-02a87c11
> >>> Waiting for node(s) to come up... (updating every 30s)
> >>> Waiting for all nodes to be in a 'running' state...
> 3/3 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> 100%
> >>> Waiting for SSH to come up on all nodes...
> 3/3 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> 100%
> >>> Waiting for cluster to come up took 0.021 mins
> !!! ERROR - Failed to add new host
> >>> Sleeping...(looping again in 60 secs)
>
> >>> Waiting for all nodes to come up...
> ...
>
> /etc/host on the master confirms the nodes haven't joined:
>
> [prod at master scripts]$ cat /etc/hosts
> 127.0.0.1 localhost localhost.localdomain localhost4
> localhost4.localdomain4
> ::1 localhost localhost.localdomain localhost6
> localhost6.localdomain6
> 10.76.35.21 master
> 10.119.19.94 node001
>
> Unfortunately, the extra nodes are up and being charged by AWS, but are
> not doing the cluster's work.
>
> And starcluster listclusters shows the cluster w/all four nodes:
>
> -----------------------------
> e1d (security group: @sc-e1d)
> -----------------------------
> Launch time: 2013-03-07 09:50:57
> Uptime: 0 days, 00:48:50
> Zone: us-east-1d
> Keypair: lapuserkey
> EBS volumes:
> <snip> ...
> Spot requests: 3 active
> Cluster nodes:
> master running i-f8ad948b ec2-50-17-99-136.compute-1.amazonaws.com
> node001 running i-c4a79eb7 ec2-50-17-178-183.compute-1.amazonaws.com(spot sir-88e00c14)
> node002 running i-7a437b09 ec2-54-234-178-59.compute-1.amazonaws.com(spot sir-4763a012)
> node003 running i-f82a128b ec2-23-20-158-195.compute-1.amazonaws.com(spot sir-02a87c11)
> Total nodes: 4
>
> Here is the debug log's relevant info; same error mode for the next
> attempt to add node003:
>
> 2013-03-07 10:26:34,302 PID: 94806 __init__.py:680 - ERROR - Failed to add
> new host
> 2013-03-07 10:26:34,308 PID: 94806 __init__.py:681 - DEBUG - Traceback
> (most recent call last):
> File
> "/Library/Python/2.7/site-packages/StarCluster-0.93.3-py2.7.egg/starcluster/balancers/sge/__init__.py",
> line 675, in _eval_add_node
> self._cluster.add_nodes(need_to_add)
> File
> "/Library/Python/2.7/site-packages/StarCluster-0.93.3-py2.7.egg/starcluster/cluster.py",
> line 838, in add_nodes
> node = self.get_node_by_alias(alias)
> File
> "/Library/Python/2.7/site-packages/StarCluster-0.93.3-py2.7.egg/starcluster/cluster.py",
> line 707, in get_node_by_alias
> raise exception.InstanceDoesNotExist(alias, label='node')
> InstanceDoesNotExist: node 'node002' does not exist
>
> Is there a place in the node adding functionality where I could
> productively add a delay so there's no further action til after the
> hostname has been added to /etc/hosts, to try to overcome this failure?
> I'll appreciate any feedback.
>
> Thanks,
> Lyn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20130307/6f8071a5/attachment.htm
More information about the StarCluster
mailing list