[StarCluster] Orphaned nodes (addnode failure) and ELB going over max cluster size when adding more than one node
Don MacMillen
macd at nimbic.com
Sun May 29 16:19:03 EDT 2011
HI,
Two issues here, as reported earlier. On the first one, running with new
logging
turned on, I see an intermittent failure of 'starcluster addnode
<clustername>'.
Error trace from log file below.
Second, on ELB adding too many nodes when adding more than one node
per iteration. The code at StarCluster/starcluster/plugins/sge/__init__.py
at line 637 reads:
if need_to_add > 0:
need_to_add = min(self.add_nodes_per_iteration, need_to_add)
The fix could be as simple as:
if need_to_add > 0:
head_room = self.max_nodes - self.stat.hosts
need_to_add = min(self.add_nodes_per_iteration, need_to_add,
head_room)
depending upon what you know about self.max_node and self.stat.hosts.
Regards,
Don
PID: 12630 cluster.py:686 - DEBUG - returning self._nodes = [<Node: master
(i-13d0707d)>, <Node: node001 (i-11d0\
707f)>, <Node: node002 (i-efd07081)>, <Node: node003 (i-edd07083)>]
PID: 12630 cluster.py:670 - DEBUG - existing nodes: {u'i-13d0707d': <Node:
master (i-13d0707d)>, u'i-11d0707f': \
<Node: node001 (i-11d0707f)>, u'i-efd07081': <Node: node002 (i-efd07081)>,
u'i-edd07083': <Node: node003 (i-edd0\
7083)>}
PID: 12630 cluster.py:673 - DEBUG - updating existing node i-13d0707d in
self._nodes
PID: 12630 cluster.py:673 - DEBUG - updating existing node i-11d0707f in
self._nodes
PID: 12630 cluster.py:673 - DEBUG - updating existing node i-efd07081 in
self._nodes
PID: 12630 cluster.py:673 - DEBUG - updating existing node i-edd07083 in
self._nodes
PID: 12630 cluster.py:686 - DEBUG - returning self._nodes = [<Node: master
(i-13d0707d)>, <Node: node001 (i-11d0\
707f)>, <Node: node002 (i-efd07081)>, <Node: node003 (i-edd07083)>]
PID: 12630 clustersetup.py:96 - INFO - Configuring hostnames...
PID: 12630 cli.py:182 - DEBUG - Traceback (most recent call last):
File "build/bdist.linux-i686/egg/starcluster/cli.py", line 160, in main
sc.execute(args)
File "build/bdist.linux-i686/egg/starcluster/commands/addnode.py", line
37, in execute
self.cm.add_node(tag, aliases)
File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 119, in
add_node
cl.add_node(alias)
File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 770, in
add_node
self.add_nodes(1, aliases=aliases)
File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 805, in
add_nodes
self.volumes)
File "build/bdist.linux-i686/egg/starcluster/clustersetup.py", line 510,
in on_add_node
self._setup_hostnames(nodes=[node])
File "build/bdist.linux-i686/egg/starcluster/clustersetup.py", line 98, in
_setup_hostnames
self.pool.simple_job(node.set_hostname, (), jobid=node.alias)
AttributeError: 'NoneType' object has no attribute 'set_hostname'
PID: 12630 cli.py:129 - ERROR - Oops! Looks like you've found a bug in
StarCluster
PID: 12630 cli.py:130 - ERROR - Debug file written to:
/tmp/starcluster-debug-staruser.log
PID: 12630 cli.py:131 - ERROR - Look for lines starting with PID: 12630
PID: 12630 cli.py:132 - ERROR - Please submit this file, minus any private
information,
PID: 12630 cli.py:133 - ERROR - to starcluster at mit.edu
PID: 12630 ssh.py:536 - DEBUG - __del__ called
PID: 12630 ssh.py:536 - DEBUG - __del__ called
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20110529/8f412459/attachment.htm
More information about the StarCluster
mailing list