[StarCluster] ELB: Failing to Add Spot-Bid Instances

Lyn Gerner schedulerqueen at gmail.com
Thu Mar 7 14:16:03 EST 2013


Hi All,

I am experimenting with StarCluster loadbalancing of SGE using spot-priced
instances.

After the wait time threshold is exceeded, starcluster puts in a new spot
instance request, and when it becomes available and SSHable, it then errors
out as it tries to add the node to the cluster.  Since the first
spot-instance fails to join, after stabilization, it tries to add another
node, with the same error result:

>>> Loading full job history
Execution hosts: 2
Queued jobs: 18
Oldest queued job: 2013-03-07 18:18:54
Avg job duration: 252 secs
Avg job wait time: 7 secs
Last cluster modification time: 2013-03-07 18:03:11
>>> A job has been waiting for 454 sec, longer than max 300
*** WARNING - Adding 1 nodes at 2013-03-07 18:26:32.471812
>>> Launching node(s): node002
SpotInstanceRequest:sir-4763a012
>>> Waiting for node(s) to come up... (updating every 30s)
>>> Waiting for all nodes to be in a 'running' state...
2/2 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100%
>>> Waiting for SSH to come up on all nodes...
2/2 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100%
>>> Waiting for cluster to come up took 0.017 mins
!!! ERROR - Failed to add new host
>>> Sleeping...(looping again in 60 secs)

>>> Waiting for all nodes to come up...
>>> Waiting for all nodes to come up...
<snip>...

>>> Loading full job history
Execution hosts: 2
Queued jobs: 17
Oldest queued job: 2013-03-07 18:20:48
Avg job duration: 282 secs
Avg job wait time: 252 secs
Last cluster modification time: 2013-03-07 18:03:11
>>> A job has been waiting for 798 sec, longer than max 300
*** WARNING - Adding 1 nodes at 2013-03-07 18:34:10.519812
>>> Launching node(s): node003
SpotInstanceRequest:sir-02a87c11
>>> Waiting for node(s) to come up... (updating every 30s)
>>> Waiting for all nodes to be in a 'running' state...
3/3 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100%
>>> Waiting for SSH to come up on all nodes...
3/3 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100%
>>> Waiting for cluster to come up took 0.021 mins
!!! ERROR - Failed to add new host
>>> Sleeping...(looping again in 60 secs)

>>> Waiting for all nodes to come up...
...

/etc/host on the master confirms the nodes haven't joined:

[prod at master scripts]$ cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4
localhost4.localdomain4
::1         localhost localhost.localdomain localhost6
localhost6.localdomain6
10.76.35.21 master
10.119.19.94 node001

Unfortunately, the extra nodes are up and being charged by AWS, but are not
doing the cluster's work.

And starcluster listclusters shows the cluster w/all four nodes:

-----------------------------
e1d (security group: @sc-e1d)
-----------------------------
Launch time: 2013-03-07 09:50:57
Uptime: 0 days, 00:48:50
Zone: us-east-1d
Keypair: lapuserkey
EBS volumes:
   <snip> ...
Spot requests: 3 active
Cluster nodes:
     master running i-f8ad948b ec2-50-17-99-136.compute-1.amazonaws.com
    node001 running i-c4a79eb7
ec2-50-17-178-183.compute-1.amazonaws.com(spot sir-88e00c14)
    node002 running i-7a437b09
ec2-54-234-178-59.compute-1.amazonaws.com(spot sir-4763a012)
    node003 running i-f82a128b
ec2-23-20-158-195.compute-1.amazonaws.com(spot sir-02a87c11)
Total nodes: 4

Here is the debug log's relevant info; same error mode for the next attempt
to add node003:

2013-03-07 10:26:34,302 PID: 94806 __init__.py:680 - ERROR - Failed to add
new host
2013-03-07 10:26:34,308 PID: 94806 __init__.py:681 - DEBUG - Traceback
(most recent call last):
  File
"/Library/Python/2.7/site-packages/StarCluster-0.93.3-py2.7.egg/starcluster/balancers/sge/__init__.py",
line 675, in _eval_add_node
    self._cluster.add_nodes(need_to_add)
  File
"/Library/Python/2.7/site-packages/StarCluster-0.93.3-py2.7.egg/starcluster/cluster.py",
line 838, in add_nodes
    node = self.get_node_by_alias(alias)
  File
"/Library/Python/2.7/site-packages/StarCluster-0.93.3-py2.7.egg/starcluster/cluster.py",
line 707, in get_node_by_alias
    raise exception.InstanceDoesNotExist(alias, label='node')
InstanceDoesNotExist: node 'node002' does not exist

Is there a place in the node adding functionality where I could
productively add a delay so there's no further action til after the
hostname has been added to /etc/hosts, to try to overcome this failure?
I'll appreciate any feedback.

Thanks,
Lyn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20130307/4bf6114d/attachment.htm


More information about the StarCluster mailing list