[StarCluster] Large cluster (125 nodes) launch failure

Justin Riley jtriley at MIT.EDU
Tue Apr 19 00:57:53 EDT 2011


Hi Adam,

On Mar 23, 2011, at 4:40 PM, Adam adamnkraut at gmail.com wrote:

> 1) EC2 instances occasionally won't come up with ssh *ever*. In that case you have to reboot the instance and it should work. This could be something like 1 in 100 instances or just an anomaly but it's worth noting. The workaround I used was to manually verify all nodes are running and port 22 is open then run starcluster start -x.


In case I didn't address this previously, there is now a --show-ssh-status option to the 'listclusters' command in the latest development code. This will show you which nodes have a running SSH daemon (SSH: Up or Down). If you run into an issue with a node's SSH never coming up please first try the 'restart' command. This will reboot all instances and wait for SSH to come up again:

$ starcluster restart mycluster

Alternatively you could use the new --show-ssh-status option to 'listclusters'  to single out and manually restart a faulty instance, however, you this would need to be done outside of StarCluster using the Amazon web console (http://aws.amazon.com/console).

HTH,

~Justin



More information about the StarCluster mailing list