[StarCluster] trouble with starting a large cluster
Rayson Ho
raysonlogin at yahoo.com
Tue Sep 6 13:19:45 EDT 2011
--- On Fri, 9/2/11, Chen, Fei [JRDUS] <FChen6 at its.jnj.com> wrote:
> Thanks for getting back to me so quickly.
No problem -- I was looking for a reason to dig into the starcluster code... and this problem seems to be interesting enough to spend the time.
(I was looking for a solution to run Open Grid Scheduler on EC2... instead of rolling my own, I found that starcluster is one of the best existing solutions out there! IMO, starcluster is very mature in terms of user features, but might need a bit more high availability & fault tolerant features to scale to hundreds of instances, and I am hoping to contribute a patch of two in this area -- however I will need to brush up my Python coding a lot more :-D )
> Yes I was aware of the new addnode feature in 0.92, among
> other things, guess it's really time for me to upgrade!
Also, in 0.92 (rc2), instead of SSHing into each node serially, a producer-consumer thread-pool is used, starcluster should be able to start large clusters faster as well.
I believe it should be possible to install 2 or more starcluster versions on the same local machine - in the end, starcluster is not setting up the local machine. This way, it is much safer to test a new version while having the working version intact.
> Lastly, is it the case that one node failing to come up
> would prevent the entire cluster from booting up?
I think we still need Justin to provide the most correct answer...
(Warning: I've only spent a full day reading various parts of starcluster.)
>From my understanding of the code, the setup process DefaultClusterSetup needs to ssh into each node to set something up (for example _setup_sge() needs to install SGE by running "./inst_sge -m -x -auto" on each node). If the ssh connection fails by throwing the SSHConnectionError exception, then I think it could interrupt the setup process of that node - but not to the point that the whole starcluster would become unusable.
I am not sure if the SSHConnectionError exception is handled (I've never used Python exception handling before...), but I think we should catch the SSHConnectionError exception, and add all the failed nodes to a list for tries after some pre-set time, and then run reboot_instances() to restart all stuck nodes & redo the setup process for those nodes.
> I was hoping that wouldn't be the
> case, but yesterday when I started with the 30 node
> cluster, I noticed
> SGE was not up and running even on nodes that came up
> properly...
If it happends again, can you at least check:
1) the NFS mounts
2) the sge deamons are running? (ps -elf|grep sge)
3) if qstat and qhost work?
Rayson
=================================
Grid Engine / Open Grid Scheduler
http://gridscheduler.sourceforge.net
>
> Thanks again,
>
> fei
>
> -----Original Message-----
> From: Rayson Ho [mailto:raysonlogin at yahoo.com]
>
> Sent: Thursday, September 01, 2011 5:04 PM
> To: starcluster at mit.edu;
> Chen, Fei [JRDUS]
> Subject: Re: [StarCluster] trouble with starting a large
> cluster
>
> --- On Thu, 9/1/11, Chen, Fei [JRDUS] <FChen6 at its.jnj.com>
> wrote:
> > #cli.py:1079 - ERROR - failed to connect to host
> > ec2-50-19-64-123.compute-1.amazonaws.com on port 22
> >
> > Looking at the AWS console, I could see all 30
> instances
> > were up and running. I even checked a few boot logs
> (e.g.
> > right click on an instance and choose the "Get System
> Log"
> > menu item), which all looked OK to me, granted I
> didn't
> > check all 30 logs...,
>
> Can you check if "ec2-50-19-64-123" is stuck??
>
> I believe once in a while, a VM on EC2 fails to startup...
> But rebooting
> the machine would work-around the issue. (May be hardware
> related or a
> bug in the EC2 provisioning layer.)
>
> http://mailman.mit.edu/pipermail/starcluster/2011-April/000703.html
>
> Rayson
>
> =================================
> Grid Engine / Open Grid Scheduler
> http://gridscheduler.sourceforge.net
>
> > maybe there is one instance having
> > trouble starting, like the above message suggesting...
> I'm
> > guessing this could be simply a timing-out issue but I
> don't
> > know if/where there's a place I can change this. Dose
> > StarCluster skip any instances that fail to come up?
> >
> > And I'm using 0.91.2. I was hoping not to have to
> upgrade
> > (yet) as I'm needing results fast and don't want to
> risk
> > breaking something during the upgrade. AWS gave me
> capacity
> > to run 400 instances, so I'm hoping this is an easily
> solved
> > problem and I would be able to use that capacity...
> >
> > Appreciate any help!
> >
> > fei
> >
> > _______________________________________________
> > StarCluster mailing list
> > StarCluster at mit.edu
> > http://mailman.mit.edu/mailman/listinfo/starcluster
> >
>
>
More information about the StarCluster
mailing list