[StarCluster] Orphaned nodes with ELB

Justin Riley jtriley at MIT.EDU
Mon May 23 10:18:58 EDT 2011


Hi Don,

Could you send me the complete log for this ELB run? You can get the full log using:

$ grep "PID: 1677" /tmp/starcluster-debug-<your_username_here>.log > complete-elb-run.log

Also, I don't think it's your plugin but sending that over for inspection wouldn't hurt.

Thanks,

~Justin

On May 20, 2011, at 2:29 PM, Don MacMillen wrote:

> Hi,
> 
> It turns out that there is another error when using ELB that can result in an orphaned
> nodes.  The current experiment was to start an initial cluster of cluster size 3 and let
> it sit until ELB has reduced it to only the master.  Then submit 500 3-minute jobs to
> SGE.  As the cluster size ramps, some of the nodes fail to be added into SGE but
> they have been spun up.  Our plugin code of 'on_add_node' also fails.  Since this 
> code attaches a tag to the instance, it is easy to see which one fails.  Logging into the master and
> looking at qhost confirms that these nodes are not in the SGE grid and are so orphaned.
> 
> In this experiment, the cluster max given to ELB was 10,  the error below happened
> 3 times while the error previously submitted happened once.  So there were a total
> of 4 orphaned nodes and a total 'cluster' size of 14, of which only 10 are useable.
> 
> Could the plugin code be the culprit here?  It is only ssh'ing one command on the
> added node (to start a upstart daemon) and then attaching the tag to the instance.
> That's it. 
> 
> I won't be able to dig into this issue for at least a week and it would be great if
> someone knows the issue exactly.  Many thanks.
> 
> Best Regards,
> 
> Don MacMillen
> 
> BTW, this is code from the github repo that was cloned last Friday.
> 
> PID: 1677 cluster.py:651 - DEBUG - updating existing node i-c99d50a7 in self._nodes
> PID: 1677 cluster.py:651 - DEBUG - updating existing node i-39ec2157 in self._nodes
> PID: 1677 cluster.py:664 - DEBUG - returning self._nodes = [<Node: master (i-c99d50a7)>, <Node: node001 (i-39ec2157)>]
> PID: 1677 cluster.py:1045 - INFO - Waiting for all nodes to be in a 'running' state...
> PID: 1677 cluster.py:1056 - INFO - Waiting for SSH to come up on all nodes...
> PID: 1677 cluster.py:648 - DEBUG - existing nodes: {u'i-c99d50a7': <Node: master (i-c99d50a7)>, u'i-39ec2157': <Node: nod\
> e001 (i-39ec2157)>}
> PID: 1677 cluster.py:651 - DEBUG - updating existing node i-c99d50a7 in self._nodes
> PID: 1677 cluster.py:651 - DEBUG - updating existing node i-39ec2157 in self._nodes
> PID: 1677 cluster.py:664 - DEBUG - returning self._nodes = [<Node: master (i-c99d50a7)>, <Node: node001 (i-39ec2157)>]
> PID: 1677 cluster.py:648 - DEBUG - existing nodes: {u'i-c99d50a7': <Node: master (i-c99d50a7)>, u'i-39ec2157': <Node: nod\
> e001 (i-39ec2157)>}
> PID: 1677 cluster.py:651 - DEBUG - updating existing node i-c99d50a7 in self._nodes                                       
> PID: 1677 cluster.py:651 - DEBUG - updating existing node i-39ec2157 in self._nodes                                       
> PID: 1677 cluster.py:664 - DEBUG - returning self._nodes = [<Node: master (i-c99d50a7)>, <Node: node001 (i-39ec2157)>]    
> PID: 1677 clustersetup.py:96 - INFO - Configuring hostnames...                                                            
> PID: 1677 __init__.py:644 - ERROR - Failed to add new host.                                                               
> PID: 1677 __init__.py:645 - DEBUG - Traceback (most recent call last):                                                    
>   File "build/bdist.linux-i686/egg/starcluster/balancers/sge/__init__.py", line 642, in _eval_add_node                    
>     self._cluster.add_nodes(need_to_add)                                                                                  
>   File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 783, in add_nodes                                        
>     self.volumes)                                                                                                         
>   File "build/bdist.linux-i686/egg/starcluster/clustersetup.py", line 512, in on_add_node                                 
>     self._setup_hostnames(nodes=[node])                                                                                   
>   File "build/bdist.linux-i686/egg/starcluster/clustersetup.py", line 98, in _setup_hostnames                             
>     self.pool.simple_job(node.set_hostname, (), jobid=node.alias)                                                         
> AttributeError: 'NoneType' object has no attribute 'set_hostname'                                                         
>                                                                                                                           
> PID: 1677 __init__.py:592 - INFO - Sleeping, looping again in 60 seconds.                                                 
>                                                                                                                           
> 
> <ATT00001..c>





More information about the StarCluster mailing list