[StarCluster] Large cluster (125 nodes) launch failure

Fri Mar 25 13:28:00 EDT 2011

See comments inline-

On Fri, Mar 25, 2011 at 11:40 AM, Kyeong Soo (Joseph) Kim <
kyeongsoo.kim at gmail.com> wrote:

> For instance, the implementation of load
> balancing would be much simpler and better and, if needed, it can
> completely terminate the whole instances.
>
> As for my own experience with 25-node clusters, I found out that the
> load balancer did not terminate the master node, even though it
> finished all assigned jobs; the master node is a single point of
> contact and had to wait for all those jobs running in other nodes to
> finish.
>
>
There is a variable in starcluster/balancers/sge/__init__.py
called:
#This would allow the master to be killed when the queue empties. UNTESTED.
allow_master_kill = False

That would kill the master once the job queue is empty. You can turn it to
True and test it if you'd like.

This raises some risks - when the master is killed, the cluster is no longer
accessible, and your results may be lost (unless you were smart enough to
put them on ebs). I kept it semi-hidden because of these risks. Since you're
obviously interested, give it a try. I used it for a little while, and it
was able to terminate the master node when the jobs were finished. Though
the cluster tags, groups, etc still exist, they won't incur any charges. at
some later date you'd still have to call 'starcluster stop <cluster_tag>.

Best,
Rajat
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20110325/f3a94b41/attachment.htm