Hi,<br><br>It turns out that there is another error when using ELB that can result in an orphaned<br>nodes. The current experiment was to start an initial cluster of cluster size 3 and let<br>it sit until ELB has reduced it to only the master. Then submit 500 3-minute jobs to<br>
SGE. As the cluster size ramps, some of the nodes fail to be added into SGE but<br>they have been spun up. Our plugin code of 'on_add_node' also fails. Since this <br>code attaches a tag to the instance, it is easy to see which one fails. Logging into the master and<br>
looking at qhost confirms that these nodes are not in the SGE grid and are so orphaned.<br><br>In this experiment, the cluster max given to ELB was 10, the error below happened<br>3 times while the error previously submitted happened once. So there were a total<br>
of 4 orphaned nodes and a total 'cluster' size of 14, of which only 10 are useable.<br><br>Could the plugin code be the culprit here? It is only ssh'ing one command on the<br>added node (to start a upstart daemon) and then attaching the tag to the instance.<br>
That's it. <br><br>I won't be able to dig into this issue for at least a week and it would be great if<br>someone knows the issue exactly. Many thanks.<br><br>Best Regards,<br><br>Don MacMillen<br><br>BTW, this is code from the github repo that was cloned last Friday.<br>
<br>PID: 1677 cluster.py:651 - DEBUG - updating existing node i-c99d50a7 in self._nodes<br>PID: 1677 cluster.py:651 - DEBUG - updating existing node i-39ec2157 in self._nodes<br>PID: 1677 cluster.py:664 - DEBUG - returning self._nodes = [<Node: master (i-c99d50a7)>, <Node: node001 (i-39ec2157)>]<br>
PID: 1677 cluster.py:1045 - INFO - Waiting for all nodes to be in a 'running' state...<br>PID: 1677 cluster.py:1056 - INFO - Waiting for SSH to come up on all nodes...<br>PID: 1677 cluster.py:648 - DEBUG - existing nodes: {u'i-c99d50a7': <Node: master (i-c99d50a7)>, u'i-39ec2157': <Node: nod\<br>
e001 (i-39ec2157)>}<br>PID: 1677 cluster.py:651 - DEBUG - updating existing node i-c99d50a7 in self._nodes<br>PID: 1677 cluster.py:651 - DEBUG - updating existing node i-39ec2157 in self._nodes<br>PID: 1677 cluster.py:664 - DEBUG - returning self._nodes = [<Node: master (i-c99d50a7)>, <Node: node001 (i-39ec2157)>]<br>
PID: 1677 cluster.py:648 - DEBUG - existing nodes: {u'i-c99d50a7': <Node: master (i-c99d50a7)>, u'i-39ec2157': <Node: nod\<br>e001 (i-39ec2157)>}<br>PID: 1677 cluster.py:651 - DEBUG - updating existing node i-c99d50a7 in self._nodes <br>
PID: 1677 cluster.py:651 - DEBUG - updating existing node i-39ec2157 in self._nodes <br>PID: 1677 cluster.py:664 - DEBUG - returning self._nodes = [<Node: master (i-c99d50a7)>, <Node: node001 (i-39ec2157)>] <br>
PID: 1677 clustersetup.py:96 - INFO - Configuring hostnames... <br>PID: 1677 __init__.py:644 - ERROR - Failed to add new host. <br>
PID: 1677 __init__.py:645 - DEBUG - Traceback (most recent call last): <br> File "build/bdist.linux-i686/egg/starcluster/balancers/sge/__init__.py", line 642, in _eval_add_node <br>
self._cluster.add_nodes(need_to_add) <br> File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 783, in add_nodes <br>
self.volumes) <br> File "build/bdist.linux-i686/egg/starcluster/clustersetup.py", line 512, in on_add_node <br>
self._setup_hostnames(nodes=[node]) <br> File "build/bdist.linux-i686/egg/starcluster/clustersetup.py", line 98, in _setup_hostnames <br>
self.pool.simple_job(node.set_hostname, (), jobid=node.alias) <br>AttributeError: 'NoneType' object has no attribute 'set_hostname' <br>
<br>PID: 1677 __init__.py:592 - INFO - Sleeping, looping again in 60 seconds. <br>
<br><br>