Hi Justin,<div><br></div><div>Thanks for your detailed response. <br><br><div class="gmail_quote">On Tue, Jul 26, 2011 at 9:35 AM, Justin Riley <span dir="ltr"><<a href="mailto:justin.t.riley@gmail.com">justin.t.riley@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">Hi Don,<br>
<br>
On 07/08/2011 07:51 AM, Don MacMillen wrote:<br>
<br>
> First, apologies for the time lag in my response and thanks for your<br>
> fixes in<br>
> ELB, they are appreciated.<br>
<br>
My apologies as well for my lag in response. I've been on vacation and<br>
am now just getting back to catching up with emails and this one is long<br>
over due.<br>
<br>
> Also, the on_shutdown is exactly what I wanted. I just make that method,<br>
> as well as the on_remove_node method, call the same code to shut down<br>
> a node for our application, which just involves stopping an upstart daemon<br>
> and deleting a file.<br>
<br>
Great, glad that's working out for you.<br>
<br>
> There are still a couple of issues that I'd like your thoughts on. First is<br>
> that we are still seeing occasional failures due to timing / eventual<br>
> consistency<br>
> of adding a node. Here are the relevant lines from the log file:<br>
><br>
> PID: 7860 cluster.py:678 - DEBUG - adding node i-eb030185 to self._nodes<br>
> list<br>
> PID: 7860 cli.py:157 - ERROR - InvalidInstanceID.NotFound: The instance<br>
> ID 'i-eb030185' does not exist<br>
><br>
> Does StarCluster return an error code when this happens? I have looked at<br>
> the code, but not studied it enough to know for sure. When we see<br>
> starcluster<br>
> return a non zero, we terminate and then restart the cluster. Is this<br>
> what you<br>
> would recommend?<br>
<br>
This is a tricky bug that's hard to debug and fix. The root cause is<br>
that StarCluster creates the instances and then immediately attempts to<br>
perform an operation that retrieves the instance's user_data<br>
(ec2-get-instance-attribute API call). In some cases EC2's<br>
infrastructure isn't quite 'ready' yet most for the call given that it's<br>
performed almost instantaneously by StarCluster in order fetch metadata<br>
assigned to the instance at launch time (such as the node's alias). This<br>
metadata is then used to tag the instances using the new tags & filters<br>
API[1] which allows faster and more advanced querying of the nodes. The<br>
majority of the time the call works without issue, especially on smaller<br>
clusters, however, every now and again, as you've seen, the call is made<br>
too soon and this error is produced.<br>
<br>
I'm working on a fix for this that will simply try/catch the call, sleep<br>
if unsuccessful, and keep trying until it's successful. That should<br>
reliably fix the issue once and for all (fingers crossed). I've created<br>
an issue on github to track this:<br>
<br>
<a href="https://github.com/jtriley/StarCluster/issues/36" target="_blank">https://github.com/jtriley/StarCluster/issues/36</a><br>
<br></blockquote><div> </div><div>Yes, I thought it must be an 'eventual consistency' issue with the EC2</div><div>database. Your fix sounds great and exactly what we need here.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
> We are also seeing another kind of failure in provisioning the cluster.<br>
> We have<br>
> been experimenting with large cluster sizes (130 instances, sometimes with<br>
> the m2.4xlarge machine type). What has happened is that in two of the 9<br>
> spin<br>
> ups of these large clusters, a single node does not have the nfs volume<br>
> mounted<br>
> correctly. It is, however, inserted into the SGE configuration, so jobs<br>
> get submitted<br>
> to the node that can never run. You might argue that 130 is beyond the<br>
> useful<br>
> limit of nfs, but our use of it is small and controlled. In any event,<br>
> we will not be<br>
> running these large clusters in production, but are rather looking at<br>
> characterization<br>
> and stress testing.<br>
><br>
> Since the starcluster documentation recommend checking that nfs is<br>
> configured<br>
> correctly on all nodes, can I assume that you have also seen this kind<br>
> of error?<br>
> If so, any thoughts on its frequency and root cause?<br>
><br>
> One thing we can think about doing is to check the nfs configuration in<br>
> the 'add'<br>
> method of the plugin. Easy enough. But when a failure occurs, we would<br>
> like<br>
> to correct it and here is where it gets interesting. What I would like<br>
> it to just<br>
> have access to the current Cluster instance and then to call its<br>
> add_node and<br>
> remove_node methods, but I have not found a way to accomplish that. Instead<br>
> it looks like we have to create a new cluster instance and before that a<br>
> new config<br>
> instance so something like the following code can be made to work:<br>
><br>
> <code><br>
><br>
> from starcluster.config import StarClusterConfig<br>
> from starcluster.cluster import ClusterManager<br>
> ...<br>
> cfg = StarClusterConfig(MY_STARCLUSTER_CONFIG_FILE)<br>
> cfg.load()<br>
> cm = ClusterManager(cfg)<br>
> cluster = cm.get_cluster(cluster_name)<br>
> ...<br>
> for node in nodes:<br>
> if not nfs_ok(node):<br>
> alias = node.alias<br>
> cluster.remove_node(alias)<br>
> cluster.add_node(alias)<br>
><br>
> </code><br>
><br>
> Would you recommend this way of correcting these failures? It seems<br>
> like a cumbersome way to go about it.<br>
<br>
Did you check the logs to see if there were errors related to<br>
configuring NFS on the new nodes? Any errors related to mounting NFS<br>
*should* be printed to the screen as well. Ideally you shouldn't need to<br>
fiddle with NFS so it'd be good to see what the issue is and try to<br>
resolve it properly in the core code rather than having it be a concern<br>
of your plugin. Perhaps we just need a sanity check after mounting the<br>
NFS-share and retry the mount automatically if that fails. However,<br>
before going that route I'd like to know if there are specific error<br>
messages in the log before attempting to blindly remount and hoping it<br>
fixes the issue.<br>
<br></blockquote><div> </div><div>I did check the logs and they do show the nfs errors. However, I don't</div><div>think (can't recall) that the errors were also printed to the screen, but they</div><div>were definitely in the log file in /tmp</div>
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
Also, on the nodes that failed to be configured with NFS properly did<br>
you check if they were listed in /etc/exports on the master node and<br>
that /etc/fstab on their local filesystems pointed to the corresponding<br>
NFS export on the master? If they were I would recommend in the mean<br>
time to have your plugin retry mounting the drive, possibly creating the<br>
mount directory if needed:<br>
<br>
mtpt = '/mnt/mymountpoint'<br>
if not node.ssh.isdir(mtpt):<br>
node.ssh.mkdir(mtpt)<br>
node.ssh.execute('mount %s' % mtpt)<br></blockquote><div><br></div><div>I did not think to check /etc/exports on the master nor the /etc/fstab</div><div>on the slave nodes. I will be sure to check next time the issue occurs.</div>
<div><br></div><div>I did code the 'error recovery' into our plugin. However, I know that it</div><div>still has problems for it failed on the one time a new cluster showed the</div><div>nfs problems. However, that was awhile ago and we just terminated the</div>
<div>cluster and spun up a new one. I still need to debug this error handling</div><div>code but that will be tricky (but not impossible).</div><div><br></div><div>Next time I am able to catch a cluster with a bad nfs, I will do the checks</div>
<div>you suggest and also look to move the plugin code to do the remount as</div><div>in your code snippet.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<br>
> A final suggestion / request, can you include timestamps on all the loggers?<br>
> We have seen a great deal of variability in the times needed for startup<br>
> and it would be great to characterize more closely. For instance, the nfs<br>
> config time for a cluster size of 4 is usually around 30 seconds, but we<br>
> have seen it as high as over 3 minutes.<br>
><br>
> As I am sure you know, we could accomplish this by changing some format<br>
> strings in the logger.py module. Perhaps something like the following:<br>
><br>
> INFO_FORMAT = " ".join(['>>>', "%(asctime)s", "%(message)s\n"])<br>
> DEBUG_FORMAT = "%(asctime)s %(filename)s:%(lineno)d - %(levelname)s -<br>
> %(message)s\n"<br>
> DEFAULT_CONSOLE_FORMAT = "%(asctime) %(levelname)s - %(message)s\n"<br>
><br>
> But perhaps many would find this too ugly?<br>
<br>
I'm game for putting %(asctime)s in the DEBUG_FORMAT string but not in<br>
the console (INFO) level because it clutters the output. Having it in<br>
DEBUG_FORMAT will make sure that all messages in the debug file are<br>
timestamped. This should meet your needs given that *all* log messages<br>
go to the debug file. Does that work for you? I created an issue to<br>
track this:<br>
<br>
<a href="https://github.com/jtriley/StarCluster/issues/35" target="_blank">https://github.com/jtriley/StarCluster/issues/35</a></blockquote><div><br></div><div>That works great for me. Many thanks. </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<br>
<br>
<br>
> In any event, many thanks for your help and for your great work with<br>
> StarCluster.<br>
<br>
Thank you, my pleasure. Thanks for testing and reporting issues :D<br>
<font color="#888888"><br>
~Justin<br>
</font></blockquote></div><br></div>