[StarCluster] Large cluster (125 nodes) launch failure

Fri Mar 25 13:06:48 EDT 2011

On Fri, Mar 25, 2011 at 3:40 PM, Kyeong Soo (Joseph) Kim
<kyeongsoo.kim at gmail.com> wrote:
<--snip-->

> In this regard it would be really great if we could implement a
> simpler batch queueing system on the StarCluster itself and therefore
> do away with the SGE. For instance, the implementation of load
> balancing would be much simpler and better and, if needed, it can
> completely terminate the whole instances.

<--snip-->

For one, I am really interested in an alternative to SGE, given its
uncertain status after the Oracle purchase of Sun. Having discussed
this at some length with Mr. Riley during previous conversations, its
entirely likely that a good solution is possible using existing open
source software, at least to a large extent. However, some components
of SGE will be time consuming to implement in comparable fashion, for
example the internal SGE load balancing mechanism.

In any event, we, at my company, have had an excellent experience
using the python-based Celery Distributed Task Queue [1] on local
development and production clusters. One potential downside for this
is, in our case, the requirement of the Erlang-based RabbitMQ for
message queuing (using AMQP). Celery itself does not demand RabbitMQ,
but in our experience it worked the best for our use cases. Its
entirely possible to replace RabbitMQ with something like ZeroMQ,
however this could require considerable effort.

If anyone is interested in exploring this further, and in greater
detail with the goal of an implementation, lets talk. Perhaps a new
thread would be appropriate. I certainly think this could be highly
beneficial and would be happy to participate.

Kind Regards

Matthew W. Summers

[1] http://ask.github.com/celery/