[StarCluster] starcluster starts but not all nodes added as exec nodes

Jeff White jeff at decide.com
Sat Mar 5 17:15:41 EST 2011


I can frequently reproduce an issue where 'starcluster start' completes
without error, but not all nodes are added to the SGE pool, which I verify
by running 'qconf -sel' on the master. The latest example I have is creating
a 25-node cluster, where only the first 12 nodes are successfully installed.
The remaining instances are running and I can ssh to them but they aren't
running sge_execd. There are only install log files for the first 12 nodes
in /opt/sge6/default/common/install_logs. I have not found any clues in the
starcluster debug log or the logs inside master:/opt/sge6/.

I am running starcluster development snapshot 8ef48a3 downloaded on
2011-02-15, with the following relevant settings:

NODE_IMAGE_ID=ami-8cf913e5
NODE_INSTANCE_TYPE = m1.small

I have seen this behavior with the latest 32-bit and 64-bit starcluster
AMIs. Our workaround is to start a small cluster and progressively add nodes
one at a time, which is time-consuming.

Has anyone else noticed this and have a better workaround or an idea for a
fix?

jeff
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20110305/4088b3f2/attachment.htm


More information about the StarCluster mailing list