[StarCluster] Orphaned nodes with ELB
Justin Riley
jtriley at MIT.EDU
Mon May 23 10:18:58 EDT 2011
Hi Don,
Could you send me the complete log for this ELB run? You can get the full log using:
$ grep "PID: 1677" /tmp/starcluster-debug-<your_username_here>.log > complete-elb-run.log
Also, I don't think it's your plugin but sending that over for inspection wouldn't hurt.
Thanks,
~Justin
On May 20, 2011, at 2:29 PM, Don MacMillen wrote:
> Hi,
>
> It turns out that there is another error when using ELB that can result in an orphaned
> nodes. The current experiment was to start an initial cluster of cluster size 3 and let
> it sit until ELB has reduced it to only the master. Then submit 500 3-minute jobs to
> SGE. As the cluster size ramps, some of the nodes fail to be added into SGE but
> they have been spun up. Our plugin code of 'on_add_node' also fails. Since this
> code attaches a tag to the instance, it is easy to see which one fails. Logging into the master and
> looking at qhost confirms that these nodes are not in the SGE grid and are so orphaned.
>
> In this experiment, the cluster max given to ELB was 10, the error below happened
> 3 times while the error previously submitted happened once. So there were a total
> of 4 orphaned nodes and a total 'cluster' size of 14, of which only 10 are useable.
>
> Could the plugin code be the culprit here? It is only ssh'ing one command on the
> added node (to start a upstart daemon) and then attaching the tag to the instance.
> That's it.
>
> I won't be able to dig into this issue for at least a week and it would be great if
> someone knows the issue exactly. Many thanks.
>
> Best Regards,
>
> Don MacMillen
>
> BTW, this is code from the github repo that was cloned last Friday.
>
> PID: 1677 cluster.py:651 - DEBUG - updating existing node i-c99d50a7 in self._nodes
> PID: 1677 cluster.py:651 - DEBUG - updating existing node i-39ec2157 in self._nodes
> PID: 1677 cluster.py:664 - DEBUG - returning self._nodes = [<Node: master (i-c99d50a7)>, <Node: node001 (i-39ec2157)>]
> PID: 1677 cluster.py:1045 - INFO - Waiting for all nodes to be in a 'running' state...
> PID: 1677 cluster.py:1056 - INFO - Waiting for SSH to come up on all nodes...
> PID: 1677 cluster.py:648 - DEBUG - existing nodes: {u'i-c99d50a7': <Node: master (i-c99d50a7)>, u'i-39ec2157': <Node: nod\
> e001 (i-39ec2157)>}
> PID: 1677 cluster.py:651 - DEBUG - updating existing node i-c99d50a7 in self._nodes
> PID: 1677 cluster.py:651 - DEBUG - updating existing node i-39ec2157 in self._nodes
> PID: 1677 cluster.py:664 - DEBUG - returning self._nodes = [<Node: master (i-c99d50a7)>, <Node: node001 (i-39ec2157)>]
> PID: 1677 cluster.py:648 - DEBUG - existing nodes: {u'i-c99d50a7': <Node: master (i-c99d50a7)>, u'i-39ec2157': <Node: nod\
> e001 (i-39ec2157)>}
> PID: 1677 cluster.py:651 - DEBUG - updating existing node i-c99d50a7 in self._nodes
> PID: 1677 cluster.py:651 - DEBUG - updating existing node i-39ec2157 in self._nodes
> PID: 1677 cluster.py:664 - DEBUG - returning self._nodes = [<Node: master (i-c99d50a7)>, <Node: node001 (i-39ec2157)>]
> PID: 1677 clustersetup.py:96 - INFO - Configuring hostnames...
> PID: 1677 __init__.py:644 - ERROR - Failed to add new host.
> PID: 1677 __init__.py:645 - DEBUG - Traceback (most recent call last):
> File "build/bdist.linux-i686/egg/starcluster/balancers/sge/__init__.py", line 642, in _eval_add_node
> self._cluster.add_nodes(need_to_add)
> File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 783, in add_nodes
> self.volumes)
> File "build/bdist.linux-i686/egg/starcluster/clustersetup.py", line 512, in on_add_node
> self._setup_hostnames(nodes=[node])
> File "build/bdist.linux-i686/egg/starcluster/clustersetup.py", line 98, in _setup_hostnames
> self.pool.simple_job(node.set_hostname, (), jobid=node.alias)
> AttributeError: 'NoneType' object has no attribute 'set_hostname'
>
> PID: 1677 __init__.py:592 - INFO - Sleeping, looping again in 60 seconds.
>
>
> <ATT00001..c>
More information about the StarCluster
mailing list