[StarCluster] Start timeout for spot instances

Mon Mar 19 15:18:48 EDT 2012

On 3/18/2012 1:17 PM, David Erickson wrote:
> Hi it would be great if there were some kind of timeout option for spot
> instances, ie if they aren't started by some deadline then shut down
> everything and return an error exit code.  That way a script running
> starcluster could then re-try with regular ondemand instances if there
> is a deadline to getting some work done.

I should follow this up with some more details:

My workload ideally requires 50 spot instances running SGE jobs, I have 
50 jobs so running them all in parallel at once is ideal since this is 
one step in a serial process.  This weekend I ran my scripts that use 
StarCluster to setup a cluster and run jobs on it then tear it down, 
etc.  However it was unable to ever allocate the 50 machines and hung 
there waiting for the SIR to become active for 8 hours and 5 hours 
during two different sessions (primarily overnight).  I did some reading 
and apparently AWS will not launch any of the nodes in the group unless 
it is able to launch all of them (which I find wrong because I tried a 
25 node launch later and it launched 5 then hung on the remaining 20 for 
an hour before I gave up).  What would be ideal for me would be for 
StarCluster to create a multi-zone cluster, possibly using load balance 
as a base, since my key goal is 50 machines, and the network traffic 
inbetween is insignificant.  Presumably you would specify which zone 
houses the master as it could have EBS attached to it that is then 
shared over NFS to the other machines.  Has there been any thought or 
code headed toward enabling something like this?

Thanks,
David