My last post was from a new email address (my Company is changing names)<br>and hasn't made it to the list yet. I have included it below and I have an update <br>on the intermittent failure of add_nodes.<br><br>In the method 'add_nodes' of the class 'Cluster', the call to the method <br>
get_node_alias() at line 801 can sometimes return None for the node,<br>as doing a conditional break in pdb (b cluster:802 ,node==None) will<br>show. Calling self.get_node_by_alias(alias) again results in getting<br>the valid node, so we have a timing problem.<br>
<br>My guess (and this is only a guess) is that there is a problem with<br>the logic in 'wait_for_cluster'. The guess is simply that one of the <br>waits in that method is not waiting for the entire new size of cluster<br>
correctly, so that this timing problem would only manifest during an<br>add_node after an initial cluster spin. In any event, I will let the experts<br>take it from here.<br><br>BTW, I do not have a good handle on the failure rates, but they look to<br>
be below 10%<br><br><br>Regards,<br><br>Don<br><br><br><br>HI,<br><br>Two issues here, as reported earlier. On the first one, running with new logging<br>turned on, I see an intermittent failure of 'starcluster addnode <clustername>'.<br>
Error trace from log file below.<br>
<br>Second, on ELB adding too many nodes when adding more than one node<br>per iteration. The code at StarCluster/starcluster/<div id=":13k">plugins/sge/__init__.py<br>at line 637 reads:<br><br> if need_to_add > 0:<br>
need_to_add = min(self.add_nodes_per_iteration, need_to_add)<br>
<br>The fix could be as simple as:<br><br> if need_to_add > 0:<br> head_room = self.max_nodes - self.stat.hosts<br> need_to_add = min(self.add_nodes_per_iteration, need_to_add, head_room)<br>
<br>depending upon what you know about self.max_node and self.stat.hosts.<br><br>Regards,<br><br>Don<br><br>PID: 12630 cluster.py:686 - DEBUG - returning self._nodes = [<Node: master (i-13d0707d)>, <Node: node001 (i-11d0\<br>
707f)>, <Node: node002 (i-efd07081)>, <Node: node003 (i-edd07083)>]<br>PID: 12630 cluster.py:670 - DEBUG - existing nodes: {u'i-13d0707d': <Node: master (i-13d0707d)>, u'i-11d0707f': \<br>
<Node: node001 (i-11d0707f)>, u'i-efd07081': <Node: node002 (i-efd07081)>, u'i-edd07083': <Node: node003 (i-edd0\<br>7083)>}<br>PID: 12630 cluster.py:673 - DEBUG - updating existing node i-13d0707d in self._nodes<br>
PID: 12630 cluster.py:673 - DEBUG - updating existing node i-11d0707f in self._nodes<br>PID: 12630 cluster.py:673 - DEBUG - updating existing node i-efd07081 in self._nodes<br>PID: 12630 cluster.py:673 - DEBUG - updating existing node i-edd07083 in self._nodes<br>
PID: 12630 cluster.py:686 - DEBUG - returning self._nodes = [<Node: master (i-13d0707d)>, <Node: node001 (i-11d0\<br>707f)>, <Node: node002 (i-efd07081)>, <Node: node003 (i-edd07083)>]<br>PID: 12630 clustersetup.py:96 - INFO - Configuring hostnames...<br>
PID: 12630 cli.py:182 - DEBUG - Traceback (most recent call last):<br> File "build/bdist.linux-i686/egg/starcluster/cli.py", line 160, in main<br> sc.execute(args)<br> File "build/bdist.linux-i686/egg/starcluster/commands/addnode.py", line 37, in execute<br>
self.cm.add_node(tag, aliases)<br> File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 119, in add_node<br> cl.add_node(alias)<br> File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 770, in add_node<br>
self.add_nodes(1, aliases=aliases)<br> File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 805, in add_nodes<br> self.volumes)<br> File "build/bdist.linux-i686/egg/starcluster/clustersetup.py", line 510, in on_add_node<br>
self._setup_hostnames(nodes=[node])<br> File "build/bdist.linux-i686/egg/starcluster/clustersetup.py", line 98, in _setup_hostnames<br> self.pool.simple_job(node.set_hostname, (), jobid=node.alias)<br>AttributeError: 'NoneType' object has no attribute 'set_hostname'<br>
<br>PID: 12630 cli.py:129 - ERROR - Oops! Looks like you've found a bug in StarCluster<br>PID: 12630 cli.py:130 - ERROR - Debug file written to: /tmp/starcluster-debug-staruser.log<br>PID: 12630 cli.py:131 - ERROR - Look for lines starting with PID: 12630<br>
PID: 12630 cli.py:132 - ERROR - Please submit this file, minus any private information,<br>PID: 12630 cli.py:133 - ERROR - to <a href="mailto:starcluster@mit.edu" target="_blank">starcluster@mit.edu</a><br>PID: 12630 ssh.py:536 - DEBUG - __del__ called<br>
PID: 12630 ssh.py:536 - DEBUG - __del__ called<br></div><br>