<div dir="ltr">I am having trouble scaling up a cluster to handle multiple samples. A test with one 1Tb attached EBS drive and 30 nodes worked great. Moving to 10 EBS drives and 300 nodes crashed at about 70-80 nodes. There are no errors in the STDOUT or STDERR output of jobs. Looking the in node messages (/opt/sge6/spool/...), most nodes have a commlib error got read error , followed by the job failing causing a shepherd error. The cluster is set up on a private subnet and as I mentioned I have 10 EBS drives attached which may be too much? I was also pushing the nodes with intense work, with an nload_average of 3-4. My guess is the load is increased by having a larger cluster and causing the issue. Does anyone know if that could be the problem or have any other ideas?<div><br></div><div>Thanks,</div><div> Nick<br clear="all"><div><br></div><br>
</div></div>