[StarCluster] Start timeout for spot instances

Thu Mar 22 17:08:20 EDT 2012

Hi David,

This is useful although it could wind up wasting several instance hours
if all but one spot instance fails to come up for example. A better
approach would be for the start command to monitor spot instance
requests up until a deadline at which point the spot requests which
haven't come up yet are cancelled and on-demand instances are requested
in their place. This would avoid having to write this logic in your
script completely and also enables a mode where you 'get as much spot as
you can' up until a specified timeout. What do you think?

I've created an issue to keep track of this:

http://web.mit.edu/star/cluster/issues/98

~Justin

On Mon, Mar 19, 2012 at 12:18:48PM -0700, David Erickson wrote:
> On 3/18/2012 1:17 PM, David Erickson wrote:
> > Hi it would be great if there were some kind of timeout option for spot
> > instances, ie if they aren't started by some deadline then shut down
> > everything and return an error exit code.  That way a script running
> > starcluster could then re-try with regular ondemand instances if there
> > is a deadline to getting some work done.
>
> I should follow this up with some more details:
>
> My workload ideally requires 50 spot instances running SGE jobs, I have
> 50 jobs so running them all in parallel at once is ideal since this is
> one step in a serial process.  This weekend I ran my scripts that use
> StarCluster to setup a cluster and run jobs on it then tear it down,
> etc.  However it was unable to ever allocate the 50 machines and hung
> there waiting for the SIR to become active for 8 hours and 5 hours
> during two different sessions (primarily overnight).  I did some reading
> and apparently AWS will not launch any of the nodes in the group unless
> it is able to launch all of them (which I find wrong because I tried a
> 25 node launch later and it launched 5 then hung on the remaining 20 for
> an hour before I gave up).  What would be ideal for me would be for
> StarCluster to create a multi-zone cluster, possibly using load balance
> as a base, since my key goal is 50 machines, and the network traffic
> inbetween is insignificant.  Presumably you would specify which zone
> houses the master as it could have EBS attached to it that is then
> shared over NFS to the other machines.  Has there been any thought or
> code headed toward enabling something like this?
>
> Thanks,
> David
> _______________________________________________
> StarCluster mailing list
> StarCluster at mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
Url : http://mailman.mit.edu/pipermail/starcluster/attachments/20120322/bc26f7a8/attachment.bin