[StarCluster] when spot instances die

Cedar McKay cmckay at uw.edu
Fri Jul 24 22:02:11 EDT 2015


I have a cluster of about 80 nodes, running sge jobs. 20 of my nodes were killed by Amazon when supply became tight. Can someone suggest a strategy for recovering? I don’t really know what to do, because when I do `qstat` it still shows jobs running on the zombie nodes. In addition, `starcluster lc` shows those dead instances as still part of the cluster. How do I recover gracefully from this situation, paying particular attention to making sure the affected jobs are resubmitted to other nodes?


Thanks for any advice!

best,
Cedar





More information about the StarCluster mailing list