[StarCluster] 100 nodes cluster

Fri Oct 28 12:16:03 EDT 2011

The latest (0.92) 

Cheers,
Paolo

On Oct 28, 2011, at 6:07 PM, Matthew Summers wrote:

> On Fri, Oct 28, 2011 at 10:44 AM, Paolo Di Tommaso
> <Paolo.DiTommaso at crg.eu> wrote:
>> Hi Gordon,
>> Starting a 100 nodes cluster it takes 30 minutes (and 1 hour with 200).
>> Using a EBS backed AMI the machines boot time is very short less than 1
>> minute and above all constant (does not increment increasing the number of
>> requested instances).
>> So all the time is spend in to configure the cluster.
>> StarCluster do a lot of tasks automatically (and for this reason I love
>> it!).
>> But saving the state for a configured cluster, another cluster instance
>> could be deployed updating only the /etc/hosts files and the SGE queue
>> configuration. This would reduce a lot the total amount of time required to
>> start.
>> Does it make sense ?
>> 
>> Cheers,
>> Paolo
>> 
>> 
>> 
>> On Oct 28, 2011, at 4:24 PM, Mark Gordon wrote:
>> 
>> Hi Paolo:
>> 
>> I wonder, what percentage of the launch time do you think is spend
>> configuring the nodes?
>> 
>> cheers,
>> Mark
>> 
>> 
>> On Fri, Oct 28, 2011 at 4:57 AM, Paolo Di Tommaso <Paolo.DiTommaso at crg.eu>
>> wrote:
>>> 
>>> Dear All,
>>> 
>>> I'm still struggling with this problem with large cluster that requires so
>>> long time to be launched.
>>> 
>>> I think that some improvements are possible having a better multithread
>>> handling, but I'm not a Python guru, so I cannot say about that in details.
>>> 
>>> Anyway I'm looking for a more "radical" approach. My idea is to launch a
>>> 2-node cluster, save the master and slave nodes as two separate AMIs and use
>>> these to deploy a cluster of any size without having to install and
>>> configure everything from scratch (NFS, SGE, password less access, etc) but
>>> modifying only what is changed.
>>> 
>>> 
>>> So my questions is: which are the "delta" in the configuration files
>>> between two different cluster instances of X and Y nodes ?
>>> 
>>> Knowing this it could be quite easy write a StarCluster plugin that will
>>> apply only these changes, achieving a much more faster launch time.
>>> 
>>> 
>>> Thank you,
>>> 
>>> Paolo Di Tommaso
>>> Software Engineer
>>> Comparative Bioinformatics Group
>>> Centre de Regulacio Genomica (CRG)
>>> Dr. Aiguader, 88
>>> 08003 Barcelona, Spain
>>> 
>>> 
>>> 
>>> 
>>> On Oct 20, 2011, at 9:48 PM, Rayson Ho wrote:
>>> 
>>>> ----- Original Message -----
>>>>> However, if one can wrap around the real
>>>> ssh with a fake ssh script that sleeps 30 seconds and then runs the
>>>> real
>>>>> ssh, then we can see how good (or bad) the Workerpool handles long
>>>> latency commands - and we will start from
>>>>> there to optimize the launch
>>>> performance.
>>>> 
>>>> Replying to myself - after quickly reading the code...
>>>> 
>>>> StarCluster uses Paramiko instead of executing ssh, so wrapping around a
>>>> long latency ssh script won't work.
>>>> 
>>>> And there are quite a lot of discussions about issues with multithreaded
>>>> programs that call Paramiko -- just google: Paramiko+multithreading
>>>> 
>>>> 
>>>> Rayson
>>>> 
>>>> =================================
>>>> Grid Engine / Open Grid Scheduler
>>>> http://gridscheduler.sourceforge.net
>>>> _______________________________________________
>>>> StarCluster mailing list
>>>> StarCluster at mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>> 
>>> 
>>> _______________________________________________
>>> StarCluster mailing list
>>> StarCluster at mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>> 
>> 
>> 
>> 
>> --
>> 
>> Mark Gordon
>> 
>> Systems Analyst
>> Department of Physics
>> University of Alberta
>> 
>> This communication is intended for the use of the recipient to which it is
>> addressed and may contain confidential, personal and/or privileged
>> information. Please contact us immediately if you are not the intended
>> recipient of this communication. If you are not the intended recipient of
>> this communication do not copy, distribute or take action on it. Any
>> communication received in error, or subsequent reply, should be deleted or
>> destroyed.
>> 
>> 
>> _______________________________________________
>> StarCluster mailing list
>> StarCluster at mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster
>> 
>> 
> 
> What version of starcluster are you using, Paolo?
> 
> -- 
> Matthew W. Summers
> Gentoo Foundation Inc.