[StarCluster] trouble with starting a large cluster

Chen, Fei [JRDUS] FChen6 at its.jnj.com
Fri Sep 2 11:25:55 EDT 2011


Hi Rayson,

Thanks for getting back to me so quickly. I had shutdown the 30-node
grid so I couldn't check in the log whether ec2-50-19-64-123 was stuck,
but I did try at the time to ssh into it without success, so presumably
it did fail to boot up. 

Yes I was aware of the new addnode feature in 0.92, among other things,
guess it's really time for me to upgrade!

Lastly, is it the case that one node failing to come up would prevent
the entire cluster from booting up? I was hoping that wouldn't be the
case, but yesterday when I started with the 30 node cluster, I noticed
SGE was not up and running even on nodes that came up properly...

Thanks again,

fei

-----Original Message-----
From: Rayson Ho [mailto:raysonlogin at yahoo.com] 
Sent: Thursday, September 01, 2011 5:04 PM
To: starcluster at mit.edu; Chen, Fei [JRDUS]
Subject: Re: [StarCluster] trouble with starting a large cluster

--- On Thu, 9/1/11, Chen, Fei [JRDUS] <FChen6 at its.jnj.com> wrote:
> #cli.py:1079 - ERROR - failed to connect to host
> ec2-50-19-64-123.compute-1.amazonaws.com on port 22
> 
> Looking at the AWS console, I could see all 30 instances
> were up and running. I even checked a few boot logs (e.g.
> right click on an instance and choose the "Get System Log"
> menu item), which all looked OK to me, granted I didn't
> check all 30 logs...,

Can you check if "ec2-50-19-64-123" is stuck??

I believe once in a while, a VM on EC2 fails to startup... But rebooting
the machine would work-around the issue. (May be hardware related or a
bug in the EC2 provisioning layer.)

http://mailman.mit.edu/pipermail/starcluster/2011-April/000703.html

Rayson

=================================
Grid Engine / Open Grid Scheduler
http://gridscheduler.sourceforge.net

> maybe there is one instance having
> trouble starting, like the above message suggesting... I'm
> guessing this could be simply a timing-out issue but I don't
> know if/where there's a place I can change this. Dose
> StarCluster skip any instances that fail to come up?
> 
> And I'm using 0.91.2. I was hoping not to have to upgrade
> (yet) as I'm needing results fast and don't want to risk
> breaking something during the upgrade. AWS gave me capacity
> to run 400 instances, so I'm hoping this is an easily solved
> problem and I would be able to use that capacity...
> 
> Appreciate any help!
> 
> fei
> 
> _______________________________________________
> StarCluster mailing list
> StarCluster at mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
> 





More information about the StarCluster mailing list