Hi Justin,<div><br></div><div>First, apologies for the time lag in my response and thanks for your fixes in</div><div>ELB, they are appreciated.</div><div><br></div><div>Also, the on_shutdown is exactly what I wanted. I just make that method,</div>
<div>as well as the on_remove_node method, call the same code to shut down</div><div>a node for our application, which just involves stopping an upstart daemon</div><div>and deleting a file.</div><div><br></div><div>There are still a couple of issues that I'd like your thoughts on. First is</div>
<div>that we are still seeing occasional failures due to timing / eventual consistency</div><div>of adding a node. Here are the relevant lines from the log file:</div><div><br></div><div><div>PID: 7860 cluster.py:678 - DEBUG - adding node i-eb030185 to self._nodes list</div>
<div>PID: 7860 cli.py:157 - ERROR - InvalidInstanceID.NotFound: The instance ID 'i-eb030185' does not exist</div></div><div><br></div><div>Does StarCluster return an error code when this happens? I have looked at</div>
<div>the code, but not studied it enough to know for sure. When we see starcluster</div><div>return a non zero, we terminate and then restart the cluster. Is this what you</div><div>would recommend?</div><div><br></div>
<div>We are also seeing another kind of failure in provisioning the cluster. We have</div><div>been experimenting with large cluster sizes (130 instances, sometimes with</div><div>the m2.4xlarge machine type). What has happened is that in two of the 9 spin</div>
<div>ups of these large clusters, a single node does not have the nfs volume mounted</div><div>correctly. It is, however, inserted into the SGE configuration, so jobs get submitted</div><div>to the node that can never run. You might argue that 130 is beyond the useful</div>
<div>limit of nfs, but our use of it is small and controlled. In any event, we will not be</div><div>running these large clusters in production, but are rather looking at characterization</div><div>and stress testing.</div>
<div><br></div><div>Since the starcluster documentation recommend checking that nfs is configured</div><div>correctly on all nodes, can I assume that you have also seen this kind of error?</div><div>If so, any thoughts on its frequency and root cause?</div>
<div><br></div><div>One thing we can think about doing is to check the nfs configuration in the 'add'</div><div>method of the plugin. Easy enough. But when a failure occurs, we would like</div><div>to correct it and here is where it gets interesting. What I would like it to just </div>
<div>have access to the current Cluster instance and then to call its add_node and </div><div>remove_node methods, but I have not found a way to accomplish that. Instead</div><div>it looks like we have to create a new cluster instance and before that a new config</div>
<div>instance so something like the following code can be made to work:</div><div><br></div><div><code></div><div><br></div><div><div>from starcluster.config import StarClusterConfig</div><div>from starcluster.cluster import ClusterManager</div>
</div><div>...</div><div>cfg = StarClusterConfig(MY_STARCLUSTER_CONFIG_FILE)</div><div><div>cfg.load()</div><div>cm = ClusterManager(cfg)</div><div>cluster = cm.get_cluster(cluster_name)</div></div><div>...</div><div>for node in nodes:</div>
<div> if not nfs_ok(node):</div><div> alias = node.alias</div><div> cluster.remove_node(alias)</div><div> cluster.add_node(alias)</div><div><br></div><div></code></div><div><br></div><div>Would you recommend this way of correcting these failures? It seems</div>
<div>like a cumbersome way to go about it.</div><div><br></div><div>A final suggestion / request, can you include timestamps on all the loggers?</div><div>We have seen a great deal of variability in the times needed for startup</div>
<div>and it would be great to characterize more closely. For instance, the nfs</div><div>config time for a cluster size of 4 is usually around 30 seconds, but we </div><div>have seen it as high as over 3 minutes.</div><div>
<br></div><div>As I am sure you know, we could accomplish this by changing some format</div><div>strings in the logger.py module. Perhaps something like the following:</div><div><br></div><div><div>INFO_FORMAT = " ".join(['>>>', "%(asctime)s", "%(message)s\n"])</div>
<div>DEBUG_FORMAT = "%(asctime)s %(filename)s:%(lineno)d - %(levelname)s - %(message)s\n"</div><div>DEFAULT_CONSOLE_FORMAT = "%(asctime) %(levelname)s - %(message)s\n"</div><div><br></div><div>But perhaps many would find this too ugly?</div>
<div><br></div><div>In any event, many thanks for your help and for your great work with StarCluster.</div><div><br></div><div>Best Regards,</div><div><br></div><div>Don</div><div><br></div></div><div><br><div class="gmail_quote">
On Fri, Jun 3, 2011 at 9:05 AM, Justin Riley <span dir="ltr"><<a href="mailto:justin.t.riley@gmail.com">justin.t.riley@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div style="word-wrap:break-word">Hi Don/Raj,<div><br></div><div>I've merged your pull request with minor changes so you should be able to test that the latest load balancer code doesn't add more nodes than it should (ie beyond max size). Don't forget to grab the latest code before testing. I'm still working on the addnode failures you encountered which I don't believe has anything to do with EBS vs instance-store timing. I'll post updates when I have new code to test.</div>
<div><br></div><div><div><div>On Jun 2, 2011, at 9:21 AM, Don MacMillen wrote:</div><blockquote type="cite"><div>Another quick question: Does 'starcluster terminate <clustername>'</div><div>call the 'on_remove_node' method of the plugin? It looks like</div>
<div>it does not but apologies if this is documented already. From our</div>
<div>point of view, it would be useful for the terminate cluster command</div><div>to call this method.</div></blockquote><br></div><div>Stop/Terminate doesn't call on_remove_node but instead calls on_shutdown. The main difference is that on_shutdown receives *all* the nodes and is not called for each individual node to be removed. Will this work for you? You can browse the available plugin methods called by StarCluster in clustersetup.py:</div>
<div><br></div><div><a href="https://github.com/jtriley/StarCluster/blob/master/starcluster/clustersetup.py" target="_blank">https://github.com/jtriley/StarCluster/blob/master/starcluster/clustersetup.py</a></div><div><br>
</div><div>Specifically look at the ClusterSetup base class. </div><div><br></div><div>HTH,</div><div><br></div><font color="#888888"><div>~Justin</div><br></font></div></div></blockquote></div><br></div>