[StarCluster] 100 nodes cluster

Paolo Di Tommaso Paolo.DiTommaso at crg.eu
Fri Oct 28 11:44:40 EDT 2011


Hi Gordon,

Starting a 100 nodes cluster it takes 30 minutes (and 1 hour with 200). Using a EBS backed AMI the machines boot time is very short less than 1 minute and above all constant (does not increment increasing the number of requested instances).

So all the time is spend in to configure the cluster.

StarCluster do a lot of tasks automatically (and for this reason I love it!).

But saving the state for a configured cluster, another cluster instance could be deployed updating only the /etc/hosts files and the SGE queue configuration. This would reduce a lot the total amount of time required to start.

Does it make sense ?


Cheers,
Paolo




On Oct 28, 2011, at 4:24 PM, Mark Gordon wrote:

Hi Paolo:

I wonder, what percentage of the launch time do you think is spend configuring the nodes?

cheers,
Mark


On Fri, Oct 28, 2011 at 4:57 AM, Paolo Di Tommaso <Paolo.DiTommaso at crg.eu<mailto:Paolo.DiTommaso at crg.eu>> wrote:
Dear All,

I'm still struggling with this problem with large cluster that requires so long time to be launched.

I think that some improvements are possible having a better multithread handling, but I'm not a Python guru, so I cannot say about that in details.

Anyway I'm looking for a more "radical" approach. My idea is to launch a 2-node cluster, save the master and slave nodes as two separate AMIs and use these to deploy a cluster of any size without having to install and configure everything from scratch (NFS, SGE, password less access, etc) but modifying only what is changed.


So my questions is: which are the "delta" in the configuration files between two different cluster instances of X and Y nodes ?

Knowing this it could be quite easy write a StarCluster plugin that will apply only these changes, achieving a much more faster launch time.


Thank you,

Paolo Di Tommaso
Software Engineer
Comparative Bioinformatics Group
Centre de Regulacio Genomica (CRG)
Dr. Aiguader, 88
08003 Barcelona, Spain




On Oct 20, 2011, at 9:48 PM, Rayson Ho wrote:

> ----- Original Message -----
>> However, if one can wrap around the real
> ssh with a fake ssh script that sleeps 30 seconds and then runs the
> real
>> ssh, then we can see how good (or bad) the Workerpool handles long
> latency commands - and we will start from
>> there to optimize the launch
> performance.
>
> Replying to myself - after quickly reading the code...
>
> StarCluster uses Paramiko instead of executing ssh, so wrapping around a long latency ssh script won't work.
>
> And there are quite a lot of discussions about issues with multithreaded programs that call Paramiko -- just google: Paramiko+multithreading
>
>
> Rayson
>
> =================================
> Grid Engine / Open Grid Scheduler
> http://gridscheduler.sourceforge.net<http://gridscheduler.sourceforge.net/>
> _______________________________________________
> StarCluster mailing list
> StarCluster at mit.edu<mailto:StarCluster at mit.edu>
> http://mailman.mit.edu/mailman/listinfo/starcluster


_______________________________________________
StarCluster mailing list
StarCluster at mit.edu<mailto:StarCluster at mit.edu>
http://mailman.mit.edu/mailman/listinfo/starcluster




--

Mark Gordon

Systems Analyst
Department of Physics
University of Alberta

This communication is intended for the use of the recipient to which it is addressed and may contain confidential, personal and/or privileged information. Please contact us immediately if you are not the intended recipient of this communication. If you are not the intended recipient of this communication do not copy, distribute or take action on it. Any communication received in error, or subsequent reply, should be deleted or destroyed.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20111028/828ae78b/attachment.htm


More information about the StarCluster mailing list