<div dir="ltr">Hi all, <div><br></div><div>This might be more of a SGE issue than Starcluster issue but I'd really appreciate any comments. <br><br></div><div>I have a bunch of jobs running on AWS spot instances using starcluster. <b>Most of them would stuck in "t state" <u>for hours</u> and then finally execute (in the r state). </b>For instance, 50% of the jobs now that are not in qw are in "t state".<br><br>The same program/script/AMI have been used frequently and this is the worse ever. The only difference is the jobs this time are processing bigger files (~6G each, 90 of them) located on a NFS shared gp2 volume. Jobs were divided into tasks to ensure that only 4-5 jobs are processing the same file at once. The memory were not even close to be overloaded (only used 5G out of 240G each node). The long stuck in "t state" is wasting money and CPU hours. </div><div><br></div><div>Have any of you seen this issue before? Is there anyway I can fix / work around this issue? </div><div> </div><div>Thanks a lot, </div><div>Sonia</div><div> </div><div><br></div><div><div><br></div>-- <br><div class="gmail_signature"><div dir="ltr"><span style="font-family:arial,sans-serif;font-size:13px;border-collapse:collapse">Ying S. Ting</span><div><span style="border-collapse:collapse"></span><font face="arial, sans-serif">Ph.D. Candidate, MacCoss Lab</font><br><div><div><div><span style="font-family:arial,sans-serif;font-size:13px;border-collapse:collapse">Department of Genome Sciences, University of Washington</span></div><div><span style="font-family:arial,sans-serif;font-size:13px;border-collapse:collapse"><br></span></div></div></div></div></div></div>
</div></div>