[StarCluster] trouble with starting a large cluster

Thu Sep 1 17:04:27 EDT 2011

--- On Thu, 9/1/11, Chen, Fei [JRDUS] <FChen6 at its.jnj.com> wrote:
> #cli.py:1079 - ERROR - failed to connect to host
> ec2-50-19-64-123.compute-1.amazonaws.com on port 22
> 
> Looking at the AWS console, I could see all 30 instances
> were up and running. I even checked a few boot logs (e.g.
> right click on an instance and choose the "Get System Log"
> menu item), which all looked OK to me, granted I didn't
> check all 30 logs...,

Can you check if "ec2-50-19-64-123" is stuck??

I believe once in a while, a VM on EC2 fails to startup... But rebooting the machine would work-around the issue. (May be hardware related or a bug in the EC2 provisioning layer.)

http://mailman.mit.edu/pipermail/starcluster/2011-April/000703.html

Rayson

=================================
Grid Engine / Open Grid Scheduler
http://gridscheduler.sourceforge.net

> maybe there is one instance having
> trouble starting, like the above message suggesting... I'm
> guessing this could be simply a timing-out issue but I don't
> know if/where there's a place I can change this. Dose
> StarCluster skip any instances that fail to come up?
> 
> And I'm using 0.91.2. I was hoping not to have to upgrade
> (yet) as I'm needing results fast and don't want to risk
> breaking something during the upgrade. AWS gave me capacity
> to run 400 instances, so I'm hoping this is an easily solved
> problem and I would be able to use that capacity...
> 
> Appreciate any help!
> 
> fei
> 
> _______________________________________________
> StarCluster mailing list
> StarCluster at mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>