[StarCluster] ELB exceeding cluster size limits.

Don MacMillen macd at physware.com
Wed May 18 03:12:18 EDT 2011


Hi,

This happens when the dreaded 'Instance ID 'blah' does not exist' error
occurs.
As most of you know, there can be a timing issue in creating an instance,
getting
its ID, and then trying to access it.  Everyone must wrestle with this.  In
my
code there is a simple back off and retry with a time out attached.  Usually
works, but I am not happy with it.

This happened twice while the SC ELB was ramping from 1 to 10, so I wound
up with a cluster size of 12.  Of course, a qstat on the master did not show
the two now orphaned nodes and I could kill them.  But this is not a robust
solution.

So the question is, just how prevalent is this problem?  What do others to
prevent this?    I imagine that ELB must use the same addnode code as
the other parts of starcluster so that this is not a problem specific to
ELB.

Any thoughts and comments appreciated.

Regards,

Don

Don MacMillen
PhysWare



PID: 3368 __init__.py:645 - DEBUG - Traceback (most recent call last):
  File "build/bdist.linux-i686/egg/starcluster/balancers/sge/__init__.py",
line 642, in _eval_add_node
    self._cluster.add_nodes(need_to_add)
  File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 775, in
add_nodes
    self.wait_for_cluster(msg="Waiting for node(s) to come up...")
  File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 1038, in
wait_for_cluster
    nodes = self.nodes
  File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 658, in
nodes
    if n.is_master():
  File "build/bdist.linux-i686/egg/starcluster/node.py", line 690, in
is_master
    return self.alias == "master"
  File "build/bdist.linux-i686/egg/starcluster/node.py", line 89, in alias
    user_data = self.ec2.get_instance_user_data(self.id)
  File "build/bdist.linux-i686/egg/starcluster/awsutils.py", line 389, in
get_instance_user_data
    attributes = self.conn.get_instance_attribute(i.id, 'userData')
  File "build/bdist.linux-i686/egg/boto/ec2/connection.py", line 685, in
get_instance_attribute
    InstanceAttribute, verb='POST')
  File "build/bdist.linux-i686/egg/boto/connection.py", line 611, in
get_object
    raise self.ResponseError(response.status, response.reason, body)
EC2ResponseError: EC2ResponseError: 400 Bad Request
<?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>InvalidInstanceID.NotFound</Code><Message>The
instance ID 'i-c9c931a7' does not
exist</Message></Error></Errors><RequestID>05ab9b6e-66bf-4453-b7bc-0d5effaa23af</RequestID></Response>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20110518/3704aab2/attachment.htm


More information about the StarCluster mailing list