Hi Justin,<div><br></div><div>First, apologies for the time lag in my response and thanks for your fixes in</div><div>ELB, they are appreciated.</div><div><br></div><div>Also, the on_shutdown is exactly what I wanted.  I just make that method,</div>

<div>as well as the on_remove_node method, call the same code to shut down</div><div>a node for our application, which just involves stopping an upstart daemon</div><div>and deleting a file.</div><div><br></div><div>There are still a couple of issues that I&#39;d like your thoughts on.  First is</div>

<div>that we are still seeing occasional failures due to timing / eventual consistency</div><div>of adding a node.  Here are the relevant lines from the log file:</div><div><br></div><div><div>PID: 7860 cluster.py:678 - DEBUG - adding node i-eb030185 to self._nodes list</div>

<div>PID: 7860 cli.py:157 - ERROR - InvalidInstanceID.NotFound: The instance ID &#39;i-eb030185&#39; does not exist</div></div><div><br></div><div>Does StarCluster return an error code when this happens?  I have looked at</div>

<div>the code, but not studied it enough to know for sure.  When we see starcluster</div><div>return a non zero, we terminate and then restart the cluster.  Is this what you</div><div>would recommend?</div><div><br></div>

<div>We are also seeing another kind of failure in provisioning the cluster.  We have</div><div>been experimenting with large cluster sizes (130 instances, sometimes with</div><div>the m2.4xlarge machine type).  What has happened is that in two of the 9 spin</div>

<div>ups of these large clusters, a single node does not have the nfs volume mounted</div><div>correctly.  It is, however, inserted into the SGE configuration, so jobs get submitted</div><div>to the node that can never run.  You might argue that 130 is beyond the useful</div>

<div>limit of nfs, but our use of it is small and controlled.  In any event, we will not be</div><div>running these large clusters in production, but are rather looking at characterization</div><div>and stress testing.</div>

<div><br></div><div>Since the starcluster documentation recommend checking that nfs is configured</div><div>correctly on all nodes, can I assume that you have also seen this kind of error?</div><div>If so, any thoughts on its frequency and root cause?</div>

<div><br></div><div>One thing we can think about doing is to check the nfs configuration in the &#39;add&#39;</div><div>method of the plugin.  Easy enough.  But when a failure occurs, we would like</div><div>to correct it and here is where it gets interesting.  What I would like it to just </div>

<div>have access to the current Cluster instance and then to call its add_node and </div><div>remove_node methods, but I have not found a way to accomplish that.  Instead</div><div>it looks like we have to create a new cluster instance and before that a new config</div>

<div>instance so something like the following code can be made to work:</div><div><br></div><div>&lt;code&gt;</div><div><br></div><div><div>from starcluster.config import StarClusterConfig</div><div>from starcluster.cluster import ClusterManager</div>

</div><div>...</div><div>cfg = StarClusterConfig(MY_STARCLUSTER_CONFIG_FILE)</div><div><div>cfg.load()</div><div>cm = ClusterManager(cfg)</div><div>cluster = cm.get_cluster(cluster_name)</div></div><div>...</div><div>for node in nodes:</div>

<div>    if not nfs_ok(node):</div><div>        alias = node.alias</div><div>        cluster.remove_node(alias)</div><div>        cluster.add_node(alias)</div><div><br></div><div>&lt;/code&gt;</div><div><br></div><div>Would you recommend this way of correcting these failures?   It seems</div>

<div>like a cumbersome way to go about it.</div><div><br></div><div>A final suggestion / request, can you include timestamps on all the loggers?</div><div>We have seen a great deal of variability in the times needed for startup</div>

<div>and it would be great to characterize more closely.  For instance, the nfs</div><div>config time for a cluster size of 4 is usually around 30 seconds, but we </div><div>have seen it as high as over 3 minutes.</div><div>

<br></div><div>As I am sure you know, we could accomplish this by changing some format</div><div>strings in the logger.py module.  Perhaps something like the following:</div><div><br></div><div><div>INFO_FORMAT = &quot; &quot;.join([&#39;&gt;&gt;&gt;&#39;, &quot;%(asctime)s&quot;, &quot;%(message)s\n&quot;])</div>

<div>DEBUG_FORMAT = &quot;%(asctime)s %(filename)s:%(lineno)d - %(levelname)s - %(message)s\n&quot;</div><div>DEFAULT_CONSOLE_FORMAT = &quot;%(asctime) %(levelname)s - %(message)s\n&quot;</div><div><br></div><div>But perhaps many would find this too ugly?</div>

<div><br></div><div>In any event, many thanks for your help and for your great work with StarCluster.</div><div><br></div><div>Best Regards,</div><div><br></div><div>Don</div><div><br></div></div><div><br><div class="gmail_quote">

On Fri, Jun 3, 2011 at 9:05 AM, Justin Riley <span dir="ltr">&lt;<a href="mailto:justin.t.riley@gmail.com">justin.t.riley@gmail.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<div style="word-wrap:break-word">Hi Don/Raj,<div><br></div><div>I&#39;ve merged your pull request with minor changes so you should be able to test that the latest load balancer code doesn&#39;t add more nodes than it should (ie beyond max size). Don&#39;t forget to grab the latest code before testing. I&#39;m still working on the addnode failures you encountered which I don&#39;t believe has anything to do with EBS vs instance-store timing. I&#39;ll post updates when I have new code to test.</div>

<div><br></div><div><div><div>On Jun 2, 2011, at 9:21 AM, Don MacMillen wrote:</div><blockquote type="cite"><div>Another quick question: Does  &#39;starcluster terminate &lt;clustername&gt;&#39;</div><div>call the &#39;on_remove_node&#39; method of the plugin?   It looks like</div>

<div>it does not but apologies if this is documented already.  From our</div>

<div>point of view, it would be useful for the terminate cluster command</div><div>to call this method.</div></blockquote><br></div><div>Stop/Terminate doesn&#39;t call on_remove_node but instead calls on_shutdown. The main difference is that on_shutdown receives *all* the nodes and is not called for each individual node to be removed. Will this work for you? You can browse the available plugin methods called by StarCluster in clustersetup.py:</div>

<div><br></div><div><a href="https://github.com/jtriley/StarCluster/blob/master/starcluster/clustersetup.py" target="_blank">https://github.com/jtriley/StarCluster/blob/master/starcluster/clustersetup.py</a></div><div><br>

</div><div>Specifically look at the ClusterSetup base class. </div><div><br></div><div>HTH,</div><div><br></div><font color="#888888"><div>~Justin</div><br></font></div></div></blockquote></div><br></div>