[StarCluster] trouble with starting a large cluster

Chen, Fei [JRDUS] FChen6 at its.jnj.com
Thu Sep 1 16:22:21 EDT 2011


Hi all,

First I just want to express my great appreciation for Justin and all who contribute to this wonderfully useful project. It's really awesome and a game-changer for us. My group was going to purchase hardware for building an internal grid computer, instead now we use StarCluster at a fraction of what physical hardware would have cost us...

I'm new to this so please bear with me a bit. I'm having trouble starting a cluster of 30 nodes with c1.xlarge instance type but wasn't sure how to track down the problem. I had built a custom AMI with some software we need, based on ami-0af31963:starcluster-base-ubuntu-10.04-x86_64-rc1. Everything runs swimmingly with a cluster size of 10, but when I change the number to 30, I got things like this

>>> Using private key /home/feic/ec2/id_rsa-pstam-keypair (rsa)
>>> Configuring scratch space for user: sgeadmin
ssh.py:245 - ERROR - command mkdir /scratch failed with status 1
ssh.py:245 - ERROR - command ln -s /mnt/sgeadmin /scratch failed with status 1
>>> Configuring /etc/hosts on each node
>>> Configuring NFS...
>>> Configuring passwordless ssh for root
>>> Configuring passwordless ssh for user: sgeadmin
>>> Generating local RSA ssh keys for user: sgeadmin
>>> Installing Sun Grid Engine...
>>> Done Configuring Sun Grid Engine
...
Then at some point, getting

#cli.py:1079 - ERROR - failed to connect to host ec2-50-19-64-123.compute-1.amazonaws.com on port 22

Looking at the AWS console, I could see all 30 instances were up and running. I even checked a few boot logs (e.g. right click on an instance and choose the "Get System Log" menu item), which all looked OK to me, granted I didn't check all 30 logs..., maybe there is one instance having trouble starting, like the above message suggesting... I'm guessing this could be simply a timing-out issue but I don't know if/where there's a place I can change this. Dose StarCluster skip any instances that fail to come up?

And I'm using 0.91.2. I was hoping not to have to upgrade (yet) as I'm needing results fast and don't want to risk breaking something during the upgrade. AWS gave me capacity to run 400 instances, so I'm hoping this is an easily solved problem and I would be able to use that capacity...

Appreciate any help!

fei




More information about the StarCluster mailing list