Hi,<br><br>This happens when the dreaded 'Instance ID 'blah' does not exist' error occurs.<br>As most of you know, there can be a timing issue in creating an instance, getting<br>its ID, and then trying to access it. Everyone must wrestle with this. In my<br>
code there is a simple back off and retry with a time out attached. Usually<br>works, but I am not happy with it.<br><br>This happened twice while the SC ELB was ramping from 1 to 10, so I wound<br>up with a cluster size of 12. Of course, a qstat on the master did not show<br>
the two now orphaned nodes and I could kill them. But this is not a robust<br>solution.<br><br>So the question is, just how prevalent is this problem? What do others to <br>prevent this? I imagine that ELB must use the same addnode code as<br>
the other parts of starcluster so that this is not a problem specific to ELB.<br><br>Any thoughts and comments appreciated.<br><br>Regards,<br><br>Don<br><br>Don MacMillen<br>PhysWare<br><br><br><br>PID: 3368 __init__.py:645 - DEBUG - Traceback (most recent call last):<br>
File "build/bdist.linux-i686/egg/starcluster/balancers/sge/__init__.py", line 642, in _eval_add_node<br> self._cluster.add_nodes(need_to_add)<br> File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 775, in add_nodes<br>
self.wait_for_cluster(msg="Waiting for node(s) to come up...")<br> File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 1038, in wait_for_cluster<br> nodes = self.nodes<br> File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 658, in nodes<br>
if n.is_master():<br> File "build/bdist.linux-i686/egg/starcluster/node.py", line 690, in is_master<br> return self.alias == "master"<br> File "build/bdist.linux-i686/egg/starcluster/node.py", line 89, in alias<br>
user_data = self.ec2.get_instance_user_data(<a href="http://self.id">self.id</a>)<br> File "build/bdist.linux-i686/egg/starcluster/awsutils.py", line 389, in get_instance_user_data<br> attributes = self.conn.get_instance_attribute(<a href="http://i.id">i.id</a>, 'userData')<br>
File "build/bdist.linux-i686/egg/boto/ec2/connection.py", line 685, in get_instance_attribute<br> InstanceAttribute, verb='POST')<br> File "build/bdist.linux-i686/egg/boto/connection.py", line 611, in get_object<br>
raise self.ResponseError(response.status, response.reason, body)<br>EC2ResponseError: EC2ResponseError: 400 Bad Request<br><?xml version="1.0" encoding="UTF-8"?><br><Response><Errors><Error><Code>InvalidInstanceID.NotFound</Code><Message>The instance ID 'i-c9c931a7' does not exist</Message></Error></Errors><RequestID>05ab9b6e-66bf-4453-b7bc-0d5effaa23af</RequestID></Response><br>
<br><br><br>