[StarCluster] ELB: Failing to Add Spot-Bid Instances

Lyn Gerner schedulerqueen at gmail.com
Thu Mar 7 14:16:03 EST 2013

Hi All,

I am experimenting with StarCluster loadbalancing of SGE using spot-priced

After the wait time threshold is exceeded, starcluster puts in a new spot
instance request, and when it becomes available and SSHable, it then errors
out as it tries to add the node to the cluster.  Since the first
spot-instance fails to join, after stabilization, it tries to add another
node, with the same error result:

>>> Loading full job history
Execution hosts: 2
Queued jobs: 18
Oldest queued job: 2013-03-07 18:18:54
Avg job duration: 252 secs
Avg job wait time: 7 secs
Last cluster modification time: 2013-03-07 18:03:11
>>> A job has been waiting for 454 sec, longer than max 300
*** WARNING - Adding 1 nodes at 2013-03-07 18:26:32.471812
>>> Launching node(s): node002
>>> Waiting for node(s) to come up... (updating every 30s)
>>> Waiting for all nodes to be in a 'running' state...
2/2 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
>>> Waiting for SSH to come up on all nodes...
2/2 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
>>> Waiting for cluster to come up took 0.017 mins
!!! ERROR - Failed to add new host
>>> Sleeping...(looping again in 60 secs)

>>> Waiting for all nodes to come up...
>>> Waiting for all nodes to come up...

>>> Loading full job history
Execution hosts: 2
Queued jobs: 17
Oldest queued job: 2013-03-07 18:20:48
Avg job duration: 282 secs
Avg job wait time: 252 secs
Last cluster modification time: 2013-03-07 18:03:11
>>> A job has been waiting for 798 sec, longer than max 300
*** WARNING - Adding 1 nodes at 2013-03-07 18:34:10.519812
>>> Launching node(s): node003
>>> Waiting for node(s) to come up... (updating every 30s)
>>> Waiting for all nodes to be in a 'running' state...
3/3 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
>>> Waiting for SSH to come up on all nodes...
3/3 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
>>> Waiting for cluster to come up took 0.021 mins
!!! ERROR - Failed to add new host
>>> Sleeping...(looping again in 60 secs)

>>> Waiting for all nodes to come up...

/etc/host on the master confirms the nodes haven't joined:

[prod at master scripts]$ cat /etc/hosts   localhost localhost.localdomain localhost4
::1         localhost localhost.localdomain localhost6
localhost6.localdomain6 master node001

Unfortunately, the extra nodes are up and being charged by AWS, but are not
doing the cluster's work.

And starcluster listclusters shows the cluster w/all four nodes:

e1d (security group: @sc-e1d)
Launch time: 2013-03-07 09:50:57
Uptime: 0 days, 00:48:50
Zone: us-east-1d
Keypair: lapuserkey
EBS volumes:
   <snip> ...
Spot requests: 3 active
Cluster nodes:
     master running i-f8ad948b ec2-50-17-99-136.compute-1.amazonaws.com
    node001 running i-c4a79eb7
ec2-50-17-178-183.compute-1.amazonaws.com(spot sir-88e00c14)
    node002 running i-7a437b09
ec2-54-234-178-59.compute-1.amazonaws.com(spot sir-4763a012)
    node003 running i-f82a128b
ec2-23-20-158-195.compute-1.amazonaws.com(spot sir-02a87c11)
Total nodes: 4

Here is the debug log's relevant info; same error mode for the next attempt
to add node003:

2013-03-07 10:26:34,302 PID: 94806 __init__.py:680 - ERROR - Failed to add
new host
2013-03-07 10:26:34,308 PID: 94806 __init__.py:681 - DEBUG - Traceback
(most recent call last):
line 675, in _eval_add_node
line 838, in add_nodes
    node = self.get_node_by_alias(alias)
line 707, in get_node_by_alias
    raise exception.InstanceDoesNotExist(alias, label='node')
InstanceDoesNotExist: node 'node002' does not exist

Is there a place in the node adding functionality where I could
productively add a delay so there's no further action til after the
hostname has been added to /etc/hosts, to try to overcome this failure?
I'll appreciate any feedback.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20130307/4bf6114d/attachment.htm

More information about the StarCluster mailing list