Hi Justin,<div><br></div><div>Thanks for your detailed response. <br><br><div class="gmail_quote">On Tue, Jul 26, 2011 at 9:35 AM, Justin Riley <span dir="ltr">&lt;<a href="mailto:justin.t.riley@gmail.com">justin.t.riley@gmail.com</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">Hi Don,<br>

<br>

On 07/08/2011 07:51 AM, Don MacMillen wrote:<br>

<br>

&gt; First, apologies for the time lag in my response and thanks for your<br>

&gt; fixes in<br>

&gt; ELB, they are appreciated.<br>

<br>

My apologies as well for my lag in response. I&#39;ve been on vacation and<br>

am now just getting back to catching up with emails and this one is long<br>

over due.<br>

<br>

&gt; Also, the on_shutdown is exactly what I wanted.  I just make that method,<br>

&gt; as well as the on_remove_node method, call the same code to shut down<br>

&gt; a node for our application, which just involves stopping an upstart daemon<br>

&gt; and deleting a file.<br>

<br>

Great, glad that&#39;s working out for you.<br>

<br>

&gt; There are still a couple of issues that I&#39;d like your thoughts on.  First is<br>

&gt; that we are still seeing occasional failures due to timing / eventual<br>

&gt; consistency<br>

&gt; of adding a node.  Here are the relevant lines from the log file:<br>

&gt;<br>

&gt; PID: 7860 cluster.py:678 - DEBUG - adding node i-eb030185 to self._nodes<br>

&gt; list<br>

&gt; PID: 7860 cli.py:157 - ERROR - InvalidInstanceID.NotFound: The instance<br>

&gt; ID &#39;i-eb030185&#39; does not exist<br>

&gt;<br>

&gt; Does StarCluster return an error code when this happens?  I have looked at<br>

&gt; the code, but not studied it enough to know for sure.  When we see<br>

&gt; starcluster<br>

&gt; return a non zero, we terminate and then restart the cluster.  Is this<br>

&gt; what you<br>

&gt; would recommend?<br>

<br>

This is a tricky bug that&#39;s hard to debug and fix. The root cause is<br>

that StarCluster creates the instances and then immediately attempts to<br>

perform an operation that retrieves the instance&#39;s user_data<br>

(ec2-get-instance-attribute API call). In some cases EC2&#39;s<br>

infrastructure isn&#39;t quite &#39;ready&#39; yet most for the call given that it&#39;s<br>

performed almost instantaneously by StarCluster in order fetch metadata<br>

assigned to the instance at launch time (such as the node&#39;s alias). This<br>

metadata is then used to tag the instances using the new tags &amp; filters<br>

API[1] which allows faster and more advanced querying of the nodes. The<br>

majority of the time the call works without issue, especially on smaller<br>

clusters, however, every now and again, as you&#39;ve seen, the call is made<br>

too soon and this error is produced.<br>

<br>

I&#39;m working on a fix for this that will simply try/catch the call, sleep<br>

if unsuccessful, and keep trying until it&#39;s successful. That should<br>

reliably fix the issue once and for all (fingers crossed). I&#39;ve created<br>

an issue on github to track this:<br>

<br>

<a href="https://github.com/jtriley/StarCluster/issues/36" target="_blank">https://github.com/jtriley/StarCluster/issues/36</a><br>

<br></blockquote><div> </div><div>Yes, I thought it must be an &#39;eventual consistency&#39; issue with the EC2</div><div>database.  Your fix sounds great and exactly what we need here.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">


&gt; We are also seeing another kind of failure in provisioning the cluster.<br>

&gt;  We have<br>

&gt; been experimenting with large cluster sizes (130 instances, sometimes with<br>

&gt; the m2.4xlarge machine type).  What has happened is that in two of the 9<br>

&gt; spin<br>

&gt; ups of these large clusters, a single node does not have the nfs volume<br>

&gt; mounted<br>

&gt; correctly.  It is, however, inserted into the SGE configuration, so jobs<br>

&gt; get submitted<br>

&gt; to the node that can never run.  You might argue that 130 is beyond the<br>

&gt; useful<br>

&gt; limit of nfs, but our use of it is small and controlled.  In any event,<br>

&gt; we will not be<br>

&gt; running these large clusters in production, but are rather looking at<br>

&gt; characterization<br>

&gt; and stress testing.<br>

&gt;<br>

&gt; Since the starcluster documentation recommend checking that nfs is<br>

&gt; configured<br>

&gt; correctly on all nodes, can I assume that you have also seen this kind<br>

&gt; of error?<br>

&gt; If so, any thoughts on its frequency and root cause?<br>

&gt;<br>

&gt; One thing we can think about doing is to check the nfs configuration in<br>

&gt; the &#39;add&#39;<br>

&gt; method of the plugin.  Easy enough.  But when a failure occurs, we would<br>

&gt; like<br>

&gt; to correct it and here is where it gets interesting.  What I would like<br>

&gt; it to just<br>

&gt; have access to the current Cluster instance and then to call its<br>

&gt; add_node and<br>

&gt; remove_node methods, but I have not found a way to accomplish that.  Instead<br>

&gt; it looks like we have to create a new cluster instance and before that a<br>

&gt; new config<br>

&gt; instance so something like the following code can be made to work:<br>

&gt;<br>

&gt; &lt;code&gt;<br>

&gt;<br>

&gt; from starcluster.config import StarClusterConfig<br>

&gt; from starcluster.cluster import ClusterManager<br>

&gt; ...<br>

&gt; cfg = StarClusterConfig(MY_STARCLUSTER_CONFIG_FILE)<br>

&gt; cfg.load()<br>

&gt; cm = ClusterManager(cfg)<br>

&gt; cluster = cm.get_cluster(cluster_name)<br>

&gt; ...<br>

&gt; for node in nodes:<br>

&gt;     if not nfs_ok(node):<br>

&gt;         alias = node.alias<br>

&gt;         cluster.remove_node(alias)<br>

&gt;         cluster.add_node(alias)<br>

&gt;<br>

&gt; &lt;/code&gt;<br>

&gt;<br>

&gt; Would you recommend this way of correcting these failures?   It seems<br>

&gt; like a cumbersome way to go about it.<br>

<br>

Did you check the logs to see if there were errors related to<br>

configuring NFS on the new nodes? Any errors related to mounting NFS<br>

*should* be printed to the screen as well. Ideally you shouldn&#39;t need to<br>

fiddle with NFS so it&#39;d be good to see what the issue is and try to<br>

resolve it properly in the core code rather than having it be a concern<br>

of your plugin. Perhaps we just need a sanity check after mounting the<br>

NFS-share and retry the mount automatically if that fails. However,<br>

before going that route I&#39;d like to know if there are specific error<br>

messages in the log before attempting to blindly remount and hoping it<br>

fixes the issue.<br>

<br></blockquote><div> </div><div>I did check the logs and they do show the nfs errors.  However, I don&#39;t</div><div>think (can&#39;t recall) that the errors were also printed to the screen, but they</div><div>were definitely in the log file in /tmp</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

Also, on the nodes that failed to be configured with NFS properly did<br>

you check if they were listed in /etc/exports on the master node and<br>

that /etc/fstab on their local filesystems pointed to the corresponding<br>

NFS export on the master? If they were I would recommend in the mean<br>

time to have your plugin retry mounting the drive, possibly creating the<br>

mount directory if needed:<br>

<br>

mtpt = &#39;/mnt/mymountpoint&#39;<br>

if not node.ssh.isdir(mtpt):<br>

    node.ssh.mkdir(mtpt)<br>

node.ssh.execute(&#39;mount %s&#39; % mtpt)<br></blockquote><div><br></div><div>I did not think to check /etc/exports on the master nor the /etc/fstab</div><div>on the slave nodes.  I will be sure to check next time the issue occurs.</div>

<div><br></div><div>I did code the &#39;error recovery&#39; into our plugin.  However, I know that it</div><div>still has problems for it failed on the one time a new cluster showed the</div><div>nfs problems.  However, that was awhile ago and we just terminated the</div>

<div>cluster and spun up a new one.  I still need to debug this error handling</div><div>code but that will be tricky (but not impossible).</div><div><br></div><div>Next time I am able to catch a cluster with a bad nfs, I will do the checks</div>

<div>you suggest and also look to move the plugin code to do the remount as</div><div>in your code snippet.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">


<br>

&gt; A final suggestion / request, can you include timestamps on all the loggers?<br>

&gt; We have seen a great deal of variability in the times needed for startup<br>

&gt; and it would be great to characterize more closely.  For instance, the nfs<br>

&gt; config time for a cluster size of 4 is usually around 30 seconds, but we<br>

&gt; have seen it as high as over 3 minutes.<br>

&gt;<br>

&gt; As I am sure you know, we could accomplish this by changing some format<br>

&gt; strings in the logger.py module.  Perhaps something like the following:<br>

&gt;<br>

&gt; INFO_FORMAT = &quot; &quot;.join([&#39;&gt;&gt;&gt;&#39;, &quot;%(asctime)s&quot;, &quot;%(message)s\n&quot;])<br>

&gt; DEBUG_FORMAT = &quot;%(asctime)s %(filename)s:%(lineno)d - %(levelname)s -<br>

&gt; %(message)s\n&quot;<br>

&gt; DEFAULT_CONSOLE_FORMAT = &quot;%(asctime) %(levelname)s - %(message)s\n&quot;<br>

&gt;<br>

&gt; But perhaps many would find this too ugly?<br>

<br>

I&#39;m game for putting %(asctime)s in the DEBUG_FORMAT string but not in<br>

the console (INFO) level because it clutters the output. Having it in<br>

DEBUG_FORMAT will make sure that all messages in the debug file are<br>

timestamped. This should meet your needs given that *all* log messages<br>

go to the debug file. Does that work for you? I created an issue to<br>

track this:<br>

<br>

<a href="https://github.com/jtriley/StarCluster/issues/35" target="_blank">https://github.com/jtriley/StarCluster/issues/35</a></blockquote><div><br></div><div>That works great for me.  Many thanks. </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<br>

<br>

<br>

&gt; In any event, many thanks for your help and for your great work with<br>

&gt; StarCluster.<br>

<br>

Thank you, my pleasure. Thanks for testing and reporting issues :D<br>

<font color="#888888"><br>

~Justin<br>

</font></blockquote></div><br></div>