Thanks Raj. Some additional information. We are using<div>slightly modified EBS backed StarCluster images. (Mods were only</div><div>to add our executables). The EBS backed images probably have</div><div>different timing characteristics than the instance store ones.</div>
<div><br></div><div>Another quick question: Does 'starcluster terminate <clustername>'</div><div>call the 'on_remove_node' method of the plugin? It looks like</div><div>it does not but apologies if this is documented already. From our</div>
<div>point of view, it would be useful for the terminate cluster command</div><div>to call this method.</div><div><br></div><div>Many thanks.</div><div><br></div><div>Don</div><div><br><br><div class="gmail_quote">On Wed, Jun 1, 2011 at 9:15 AM, <span dir="ltr"><<a href="mailto:starcluster-request@mit.edu">starcluster-request@mit.edu</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">Send StarCluster mailing list submissions to<br>
<a href="mailto:starcluster@mit.edu">starcluster@mit.edu</a><br>
<br>
To subscribe or unsubscribe via the World Wide Web, visit<br>
<a href="http://mailman.mit.edu/mailman/listinfo/starcluster" target="_blank">http://mailman.mit.edu/mailman/listinfo/starcluster</a><br>
or, via email, send a message with subject or body 'help' to<br>
<a href="mailto:starcluster-request@mit.edu">starcluster-request@mit.edu</a><br>
<br>
You can reach the person managing the list at<br>
<a href="mailto:starcluster-owner@mit.edu">starcluster-owner@mit.edu</a><br>
<br>
When replying, please edit your Subject line so it is more specific<br>
than "Re: Contents of StarCluster digest..."<br>
<br>
<br>
Today's Topics:<br>
<br>
1. Orphaned nodes (addnode failure) and ELB going over max<br>
cluster size when adding more than one node (Don MacMillen)<br>
2. Re: Orphaned nodes (addnode failure) and ELB going over max<br>
cluster size when adding more than one node (Rajat Banerjee)<br>
<br>
<br>
----------------------------------------------------------------------<br>
<br>
Message: 1<br>
Date: Sun, 29 May 2011 13:19:03 -0700<br>
From: Don MacMillen <<a href="mailto:macd@nimbic.com">macd@nimbic.com</a>><br>
Subject: [StarCluster] Orphaned nodes (addnode failure) and ELB going<br>
over max cluster size when adding more than one node<br>
To: <a href="mailto:starcluster@mit.edu">starcluster@mit.edu</a><br>
Message-ID: <BANLkTimKLfR-RMGV9OB==<a href="mailto:W_H_daXvhzxWg@mail.gmail.com">W_H_daXvhzxWg@mail.gmail.com</a>><br>
Content-Type: text/plain; charset="iso-8859-1"<br>
<br>
HI,<br>
<br>
Two issues here, as reported earlier. On the first one, running with new<br>
logging<br>
turned on, I see an intermittent failure of 'starcluster addnode<br>
<clustername>'.<br>
Error trace from log file below.<br>
<br>
Second, on ELB adding too many nodes when adding more than one node<br>
per iteration. The code at StarCluster/starcluster/plugins/sge/__init__.py<br>
at line 637 reads:<br>
<br>
if need_to_add > 0:<br>
need_to_add = min(self.add_nodes_per_iteration, need_to_add)<br>
<br>
The fix could be as simple as:<br>
<br>
if need_to_add > 0:<br>
head_room = self.max_nodes - self.stat.hosts<br>
need_to_add = min(self.add_nodes_per_iteration, need_to_add,<br>
head_room)<br>
<br>
depending upon what you know about self.max_node and self.stat.hosts.<br>
<br>
Regards,<br>
<br>
Don<br>
<br>
PID: 12630 cluster.py:686 - DEBUG - returning self._nodes = [<Node: master<br>
(i-13d0707d)>, <Node: node001 (i-11d0\<br>
707f)>, <Node: node002 (i-efd07081)>, <Node: node003 (i-edd07083)>]<br>
PID: 12630 cluster.py:670 - DEBUG - existing nodes: {u'i-13d0707d': <Node:<br>
master (i-13d0707d)>, u'i-11d0707f': \<br>
<Node: node001 (i-11d0707f)>, u'i-efd07081': <Node: node002 (i-efd07081)>,<br>
u'i-edd07083': <Node: node003 (i-edd0\<br>
7083)>}<br>
PID: 12630 cluster.py:673 - DEBUG - updating existing node i-13d0707d in<br>
self._nodes<br>
PID: 12630 cluster.py:673 - DEBUG - updating existing node i-11d0707f in<br>
self._nodes<br>
PID: 12630 cluster.py:673 - DEBUG - updating existing node i-efd07081 in<br>
self._nodes<br>
PID: 12630 cluster.py:673 - DEBUG - updating existing node i-edd07083 in<br>
self._nodes<br>
PID: 12630 cluster.py:686 - DEBUG - returning self._nodes = [<Node: master<br>
(i-13d0707d)>, <Node: node001 (i-11d0\<br>
707f)>, <Node: node002 (i-efd07081)>, <Node: node003 (i-edd07083)>]<br>
PID: 12630 clustersetup.py:96 - INFO - Configuring hostnames...<br>
PID: 12630 cli.py:182 - DEBUG - Traceback (most recent call last):<br>
File "build/bdist.linux-i686/egg/starcluster/cli.py", line 160, in main<br>
sc.execute(args)<br>
File "build/bdist.linux-i686/egg/starcluster/commands/addnode.py", line<br>
37, in execute<br>
self.cm.add_node(tag, aliases)<br>
File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 119, in<br>
add_node<br>
cl.add_node(alias)<br>
File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 770, in<br>
add_node<br>
self.add_nodes(1, aliases=aliases)<br>
File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 805, in<br>
add_nodes<br>
self.volumes)<br>
File "build/bdist.linux-i686/egg/starcluster/clustersetup.py", line 510,<br>
in on_add_node<br>
self._setup_hostnames(nodes=[node])<br>
File "build/bdist.linux-i686/egg/starcluster/clustersetup.py", line 98, in<br>
_setup_hostnames<br>
self.pool.simple_job(node.set_hostname, (), jobid=node.alias)<br>
AttributeError: 'NoneType' object has no attribute 'set_hostname'<br>
<br>
PID: 12630 cli.py:129 - ERROR - Oops! Looks like you've found a bug in<br>
StarCluster<br>
PID: 12630 cli.py:130 - ERROR - Debug file written to:<br>
/tmp/starcluster-debug-staruser.log<br>
PID: 12630 cli.py:131 - ERROR - Look for lines starting with PID: 12630<br>
PID: 12630 cli.py:132 - ERROR - Please submit this file, minus any private<br>
information,<br>
PID: 12630 cli.py:133 - ERROR - to <a href="mailto:starcluster@mit.edu">starcluster@mit.edu</a><br>
PID: 12630 ssh.py:536 - DEBUG - __del__ called<br>
PID: 12630 ssh.py:536 - DEBUG - __del__ called<br>
-------------- next part --------------<br>
An HTML attachment was scrubbed...<br>
URL: <a href="http://mailman.mit.edu/pipermail/starcluster/attachments/20110529/8f412459/attachment-0001.htm" target="_blank">http://mailman.mit.edu/pipermail/starcluster/attachments/20110529/8f412459/attachment-0001.htm</a><br>
<br>
------------------------------<br>
<br>
Message: 2<br>
Date: Tue, 31 May 2011 14:46:54 -0400<br>
From: Rajat Banerjee <<a href="mailto:rbanerj@fas.harvard.edu">rbanerj@fas.harvard.edu</a>><br>
Subject: Re: [StarCluster] Orphaned nodes (addnode failure) and ELB<br>
going over max cluster size when adding more than one node<br>
To: Don MacMillen <<a href="mailto:macd@nimbic.com">macd@nimbic.com</a>><br>
Cc: <a href="mailto:starcluster@mit.edu">starcluster@mit.edu</a><br>
Message-ID: <BANLkTi=<a href="mailto:a0ZFQJ1B78-bBi_Zpw7A0PXSZZg@mail.gmail.com">a0ZFQJ1B78-bBi_Zpw7A0PXSZZg@mail.gmail.com</a>><br>
Content-Type: text/plain; charset=ISO-8859-1<br>
<br>
Hi Don,<br>
Thanks for your suggestions. I will test out the earlier suggestion as<br>
soon as I can and issue a pull request so that Justin can put it into<br>
the master branch.<br>
<br>
Regarding the latter suggestion, I think the add_node code needs to be<br>
more robust to detect and correct timing errors. Will work on it...<br>
<br>
Raj<br>
<br>
On Sun, May 29, 2011 at 4:19 PM, Don MacMillen <<a href="mailto:macd@nimbic.com">macd@nimbic.com</a>> wrote:<br>
> HI,<br>
><br>
> Two issues here, as reported earlier.? On the first one, running with new<br>
> logging<br>
> turned on, I see an intermittent failure of 'starcluster addnode<br>
> <clustername>'.<br>
> Error trace from log file below.<br>
><br>
> Second, on ELB adding too many nodes when adding more than one node<br>
> per iteration.? The code at? StarCluster/starcluster/plugins/sge/__init__.py<br>
> at line 637 reads:<br>
><br>
> ??????? if need_to_add > 0:<br>
> ??????????? need_to_add = min(self.add_nodes_per_iteration, need_to_add)<br>
><br>
> The fix could be as simple as:<br>
><br>
> ??????? if need_to_add > 0:<br>
> ??????????? head_room = self.max_nodes - self.stat.hosts<br>
> ??????????? need_to_add = min(self.add_nodes_per_iteration, need_to_add,<br>
> head_room)<br>
><br>
> depending upon what you know about self.max_node and self.stat.hosts.<br>
><br>
> Regards,<br>
><br>
> Don<br>
><br>
> PID: 12630 cluster.py:686 - DEBUG - returning self._nodes = [<Node: master<br>
> (i-13d0707d)>, <Node: node001 (i-11d0\<br>
> 707f)>, <Node: node002 (i-efd07081)>, <Node: node003 (i-edd07083)>]<br>
> PID: 12630 cluster.py:670 - DEBUG - existing nodes: {u'i-13d0707d': <Node:<br>
> master (i-13d0707d)>, u'i-11d0707f': \<br>
> <Node: node001 (i-11d0707f)>, u'i-efd07081': <Node: node002 (i-efd07081)>,<br>
> u'i-edd07083': <Node: node003 (i-edd0\<br>
> 7083)>}<br>
> PID: 12630 cluster.py:673 - DEBUG - updating existing node i-13d0707d in<br>
> self._nodes<br>
> PID: 12630 cluster.py:673 - DEBUG - updating existing node i-11d0707f in<br>
> self._nodes<br>
> PID: 12630 cluster.py:673 - DEBUG - updating existing node i-efd07081 in<br>
> self._nodes<br>
> PID: 12630 cluster.py:673 - DEBUG - updating existing node i-edd07083 in<br>
> self._nodes<br>
> PID: 12630 cluster.py:686 - DEBUG - returning self._nodes = [<Node: master<br>
> (i-13d0707d)>, <Node: node001 (i-11d0\<br>
> 707f)>, <Node: node002 (i-efd07081)>, <Node: node003 (i-edd07083)>]<br>
> PID: 12630 clustersetup.py:96 - INFO - Configuring hostnames...<br>
> PID: 12630 cli.py:182 - DEBUG - Traceback (most recent call last):<br>
> ? File "build/bdist.linux-i686/egg/starcluster/cli.py", line 160, in main<br>
> ??? sc.execute(args)<br>
> ? File "build/bdist.linux-i686/egg/starcluster/commands/addnode.py", line<br>
> 37, in execute<br>
> ??? self.cm.add_node(tag, aliases)<br>
> ? File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 119, in<br>
> add_node<br>
> ??? cl.add_node(alias)<br>
> ? File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 770, in<br>
> add_node<br>
> ??? self.add_nodes(1, aliases=aliases)<br>
> ? File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 805, in<br>
> add_nodes<br>
> ??? self.volumes)<br>
> ? File "build/bdist.linux-i686/egg/starcluster/clustersetup.py", line 510,<br>
> in on_add_node<br>
> ??? self._setup_hostnames(nodes=[node])<br>
> ? File "build/bdist.linux-i686/egg/starcluster/clustersetup.py", line 98, in<br>
> _setup_hostnames<br>
> ??? self.pool.simple_job(node.set_hostname, (), jobid=node.alias)<br>
> AttributeError: 'NoneType' object has no attribute 'set_hostname'<br>
><br>
> PID: 12630 cli.py:129 - ERROR - Oops! Looks like you've found a bug in<br>
> StarCluster<br>
> PID: 12630 cli.py:130 - ERROR - Debug file written to:<br>
> /tmp/starcluster-debug-staruser.log<br>
> PID: 12630 cli.py:131 - ERROR - Look for lines starting with PID: 12630<br>
> PID: 12630 cli.py:132 - ERROR - Please submit this file, minus any private<br>
> information,<br>
> PID: 12630 cli.py:133 - ERROR - to <a href="mailto:starcluster@mit.edu">starcluster@mit.edu</a><br>
> PID: 12630 ssh.py:536 - DEBUG - __del__ called<br>
> PID: 12630 ssh.py:536 - DEBUG - __del__ called<br>
><br>
><br>
> _______________________________________________<br>
> StarCluster mailing list<br>
> <a href="mailto:StarCluster@mit.edu">StarCluster@mit.edu</a><br>
> <a href="http://mailman.mit.edu/mailman/listinfo/starcluster" target="_blank">http://mailman.mit.edu/mailman/listinfo/starcluster</a><br>
><br>
><br>
<br>
<br>
<br>
------------------------------<br>
<br>
_______________________________________________<br>
StarCluster mailing list<br>
<a href="mailto:StarCluster@mit.edu">StarCluster@mit.edu</a><br>
<a href="http://mailman.mit.edu/mailman/listinfo/starcluster" target="_blank">http://mailman.mit.edu/mailman/listinfo/starcluster</a><br>
<br>
<br>
End of StarCluster Digest, Vol 22, Issue 1<br>
******************************************<br>
</blockquote></div><br></div>