I can frequently reproduce an issue where 'starcluster start' completes without error, but not all nodes are added to the SGE pool, which I verify by running 'qconf -sel' on the master. The latest example I have is creating a 25-node cluster, where only the first 12 nodes are successfully installed. The remaining instances are running and I can ssh to them but they aren't running sge_execd. There are only install log files for the first 12 nodes in /opt/sge6/default/common/install_logs. I have not found any clues in the starcluster debug log or the logs inside master:/opt/sge6/. <br>
<br>I am running starcluster development snapshot 8ef48a3 downloaded on 2011-02-15, with the following relevant settings:<br><br>NODE_IMAGE_ID=ami-8cf913e5<br>NODE_INSTANCE_TYPE = m1.small<br><br>I have seen this behavior with the latest 32-bit and 64-bit starcluster AMIs. Our workaround is to start a small cluster and progressively add nodes one at a time, which is time-consuming. <br>
<br>Has anyone else noticed this and have a better workaround or an idea for a fix?<br><br>jeff<br><br>