[Starcluster] failed cluster / detached drive

Dan Yamins dyamins at gmail.com
Fri Apr 30 20:20:35 EDT 2010


Justin,  I just had a strange situation where suddenly my cluster failed.
here were the symptoms:

1) all my active ssh terminals timed out
2) i couldn't log back in as the CLUSTER_USER (I got the "permission denied
(public key)" error  -- though I could ssh in as root
3) the mounted EBS volume appears to have disappeared  -- e.g. when I tried
to cd to it from /root, it was reported as not existing.
4) the SGE "qstat" command failed to be recognized.  (e.g. when i run "qstat
-xml" as root I got an error in finding the qstat command.)

It seems like my EBS drive might have detached ... but lots of things could
have happened.   Any thoughts?

Anyway, I killed the cluster as i didn't want o keep paying for it.  I'm
starting another one now, and will let you know what the result it.  If it
happens again I'll keep the cluster up and let you know right away.

Dan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20100430/2ee5e8ac/attachment.htm


More information about the StarCluster mailing list