[StarCluster] Connection lost among nodes

Fri Mar 27 10:49:36 EDT 2015

I am having trouble scaling up a cluster to handle multiple samples.  A
test with one 1Tb attached EBS drive and 30 nodes worked great.  Moving to
10 EBS drives and 300 nodes crashed at about 70-80 nodes.  There are no
errors in the STDOUT or STDERR output of jobs.  Looking the in node
messages (/opt/sge6/spool/...), most nodes have a commlib error got read
error , followed by the job failing causing a shepherd error.  The cluster
is set up on a private subnet and as I mentioned I have 10 EBS drives
attached which may be too much?  I was also pushing the nodes with intense
work, with an nload_average of 3-4.  My guess is the load is increased by
having a larger cluster and causing the issue.  Does anyone know if that
could be the problem or have any other ideas?

Thanks,
  Nick
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20150327/a6e5eed7/attachment.htm