[StarCluster] StarCluster Digest, Vol 22, Issue 1
Justin Riley
justin.t.riley at gmail.com
Tue Jul 26 12:35:40 EDT 2011
Hi Don,
On 07/08/2011 07:51 AM, Don MacMillen wrote:
> First, apologies for the time lag in my response and thanks for your
> fixes in
> ELB, they are appreciated.
My apologies as well for my lag in response. I've been on vacation and
am now just getting back to catching up with emails and this one is long
over due.
> Also, the on_shutdown is exactly what I wanted. I just make that method,
> as well as the on_remove_node method, call the same code to shut down
> a node for our application, which just involves stopping an upstart daemon
> and deleting a file.
Great, glad that's working out for you.
> There are still a couple of issues that I'd like your thoughts on. First is
> that we are still seeing occasional failures due to timing / eventual
> consistency
> of adding a node. Here are the relevant lines from the log file:
>
> PID: 7860 cluster.py:678 - DEBUG - adding node i-eb030185 to self._nodes
> list
> PID: 7860 cli.py:157 - ERROR - InvalidInstanceID.NotFound: The instance
> ID 'i-eb030185' does not exist
>
> Does StarCluster return an error code when this happens? I have looked at
> the code, but not studied it enough to know for sure. When we see
> starcluster
> return a non zero, we terminate and then restart the cluster. Is this
> what you
> would recommend?
This is a tricky bug that's hard to debug and fix. The root cause is
that StarCluster creates the instances and then immediately attempts to
perform an operation that retrieves the instance's user_data
(ec2-get-instance-attribute API call). In some cases EC2's
infrastructure isn't quite 'ready' yet most for the call given that it's
performed almost instantaneously by StarCluster in order fetch metadata
assigned to the instance at launch time (such as the node's alias). This
metadata is then used to tag the instances using the new tags & filters
API[1] which allows faster and more advanced querying of the nodes. The
majority of the time the call works without issue, especially on smaller
clusters, however, every now and again, as you've seen, the call is made
too soon and this error is produced.
I'm working on a fix for this that will simply try/catch the call, sleep
if unsuccessful, and keep trying until it's successful. That should
reliably fix the issue once and for all (fingers crossed). I've created
an issue on github to track this:
https://github.com/jtriley/StarCluster/issues/36
> We are also seeing another kind of failure in provisioning the cluster.
> We have
> been experimenting with large cluster sizes (130 instances, sometimes with
> the m2.4xlarge machine type). What has happened is that in two of the 9
> spin
> ups of these large clusters, a single node does not have the nfs volume
> mounted
> correctly. It is, however, inserted into the SGE configuration, so jobs
> get submitted
> to the node that can never run. You might argue that 130 is beyond the
> useful
> limit of nfs, but our use of it is small and controlled. In any event,
> we will not be
> running these large clusters in production, but are rather looking at
> characterization
> and stress testing.
>
> Since the starcluster documentation recommend checking that nfs is
> configured
> correctly on all nodes, can I assume that you have also seen this kind
> of error?
> If so, any thoughts on its frequency and root cause?
>
> One thing we can think about doing is to check the nfs configuration in
> the 'add'
> method of the plugin. Easy enough. But when a failure occurs, we would
> like
> to correct it and here is where it gets interesting. What I would like
> it to just
> have access to the current Cluster instance and then to call its
> add_node and
> remove_node methods, but I have not found a way to accomplish that. Instead
> it looks like we have to create a new cluster instance and before that a
> new config
> instance so something like the following code can be made to work:
>
> <code>
>
> from starcluster.config import StarClusterConfig
> from starcluster.cluster import ClusterManager
> ...
> cfg = StarClusterConfig(MY_STARCLUSTER_CONFIG_FILE)
> cfg.load()
> cm = ClusterManager(cfg)
> cluster = cm.get_cluster(cluster_name)
> ...
> for node in nodes:
> if not nfs_ok(node):
> alias = node.alias
> cluster.remove_node(alias)
> cluster.add_node(alias)
>
> </code>
>
> Would you recommend this way of correcting these failures? It seems
> like a cumbersome way to go about it.
Did you check the logs to see if there were errors related to
configuring NFS on the new nodes? Any errors related to mounting NFS
*should* be printed to the screen as well. Ideally you shouldn't need to
fiddle with NFS so it'd be good to see what the issue is and try to
resolve it properly in the core code rather than having it be a concern
of your plugin. Perhaps we just need a sanity check after mounting the
NFS-share and retry the mount automatically if that fails. However,
before going that route I'd like to know if there are specific error
messages in the log before attempting to blindly remount and hoping it
fixes the issue.
Also, on the nodes that failed to be configured with NFS properly did
you check if they were listed in /etc/exports on the master node and
that /etc/fstab on their local filesystems pointed to the corresponding
NFS export on the master? If they were I would recommend in the mean
time to have your plugin retry mounting the drive, possibly creating the
mount directory if needed:
mtpt = '/mnt/mymountpoint'
if not node.ssh.isdir(mtpt):
node.ssh.mkdir(mtpt)
node.ssh.execute('mount %s' % mtpt)
> A final suggestion / request, can you include timestamps on all the loggers?
> We have seen a great deal of variability in the times needed for startup
> and it would be great to characterize more closely. For instance, the nfs
> config time for a cluster size of 4 is usually around 30 seconds, but we
> have seen it as high as over 3 minutes.
>
> As I am sure you know, we could accomplish this by changing some format
> strings in the logger.py module. Perhaps something like the following:
>
> INFO_FORMAT = " ".join(['>>>', "%(asctime)s", "%(message)s\n"])
> DEBUG_FORMAT = "%(asctime)s %(filename)s:%(lineno)d - %(levelname)s -
> %(message)s\n"
> DEFAULT_CONSOLE_FORMAT = "%(asctime) %(levelname)s - %(message)s\n"
>
> But perhaps many would find this too ugly?
I'm game for putting %(asctime)s in the DEBUG_FORMAT string but not in
the console (INFO) level because it clutters the output. Having it in
DEBUG_FORMAT will make sure that all messages in the debug file are
timestamped. This should meet your needs given that *all* log messages
go to the debug file. Does that work for you? I created an issue to
track this:
https://github.com/jtriley/StarCluster/issues/35
> In any event, many thanks for your help and for your great work with
> StarCluster.
Thank you, my pleasure. Thanks for testing and reporting issues :D
~Justin
More information about the StarCluster
mailing list