[StarCluster] Many jobs stuck in "t state"

Jennifer Staab jstaab at cs.unc.edu
Wed Feb 25 13:33:16 EST 2015


With your jobs get stuck in transfer state and the primary difference is 
the size of the files being processed, my initial guess would be some 
bottleneck relating to all the I/O of the large files.  Likely way too 
much for NFS to handle efficiently so you should approach things in a 
different way.  If you run a top 
<http://linux.about.com/od/commands/l/blcmdl1_top.htm> or similar 
command on your master and worker nodes to see what processes are being 
run, and what's using the largest share of the resources during the 
course of submitting jobs you might be able to easily pinpoint the 
source of the bottleneck.

Good Luck,
-Jennifer

On 2/24/15 3:08 PM, Ying Sonia Ting wrote:
> Hi all,
>
> This might be more of a SGE issue than Starcluster issue but I'd 
> really appreciate any comments.
>
> I have a bunch of jobs running on AWS spot instances using 
> starcluster. *Most of them would stuck in "t state" _for hours_ and 
> then finally execute (in the r state). *For instance, 50% of the jobs 
> now that are not in qw are in "t state".
>
> The same program/script/AMI have been used frequently and this is the 
> worse ever. The only difference is the jobs this time are processing 
> bigger files (~6G each, 90 of them) located on a NFS shared gp2 
> volume. Jobs were divided into tasks to ensure that only 4-5 jobs are 
> processing the same file at once. The memory were not even close to be 
> overloaded (only used 5G out of 240G each node). The long stuck in "t 
> state" is wasting money and CPU hours.
>
> Have any of you seen this issue before? Is there anyway I can fix / 
> work around this issue?
> Thanks a lot,
> Sonia
>
>
> -- 
> Ying S. Ting
> Ph.D. Candidate, MacCoss Lab
> Department of Genome Sciences, University of Washington
>
>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster at mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20150225/c2716b35/attachment.htm


More information about the StarCluster mailing list