[StarCluster] Restarting bad nodes

Suchindra Sandhu suchindra at gmail.com
Mon Jun 24 03:36:54 EDT 2013


Over the last week, I frequently ran into errors while
using the addnode command to add nodes to an existing cluster. While
most of the time, it works out of the box, sometimes due to what I am
guessing are ec2 stability issues, I get nfs or ssh related
errors. Unfortunately, since the errors happen at the configuration
stage, it leaves some nodes just hanging.

Is there any way to restart just the bad nodes in a cluster? While I
can manually terminate them from the aws console and restart the
cluster, that takes a lot of time for relatively large cluster
sizes. Also sometimes I do not want to interrupt the computation on
the existing nodes and hence restarting all the nodes does not seem
like a great option.

I would appreciate suggestions/tricks/workarounds to deal with this
issue.


Thanks!

Suchindra


More information about the StarCluster mailing list