<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">

</head>

<body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">

I tried to run an SGE job array using Starcluster on a cluster of 8&nbsp;c3.4xlarge instances. &nbsp;The program I was trying to run is a perl program called fragScaff.pl. &nbsp;You start it once and it creates a job array batch script which it then submits with a qsub command.

 &nbsp;The original process runs in the background, continuously monitoring the spawned processes for completion. &nbsp;The job array batch script (run_array.csh) looks like this:

<div class="">

<div class=""><br class="">

</div>

<div class="">#/bin/bash</div>

<div class="">#$ -S /bin/bash</div>

<div class="">#$ -cwd</div>

<div class="">#$ -V</div>

<div class="">COMMAND=$(head -n $SGE_TASK_ID ./join_default_params.r1.fragScaff/job_array.txt | tail -n 1)</div>

</div>

<div class="">$COMMAND</div>

<div class=""><br class="">

</div>

<div class="">and the job_array.txt file looks like this:</div>

<div class=""><br class="">

</div>

<div class="">perl fragScaff.pl -Q ./join_default_params.r1.fragScaff,697848,0,99 -r 1 -M 200<br class="">

perl fragScaff.pl -Q ./join_default_params.r1.fragScaff,697848,100,199 -r 1 -M 200<br class="">

perl fragScaff.pl -Q ./join_default_params.r1.fragScaff,697848,200,299 -r 1 -M 200<br class="">

perl fragScaff.pl -Q ./join_default_params.r1.fragScaff,697848,300,399 -r 1 -M 200<br class="">

.</div>

<div class="">.</div>

<div class="">.</div>

<div class="">perl fragScaff.pl -Q ./join_default_params.r1.fragScaff,697848,697800,697848&nbsp;-r 1 -M 200</div>

<div class=""><br class="">

</div>

<div class="">The qsub command looks like this:</div>

<div class=""><br class="">

</div>

<div class="">qsub -t 1-6780 -N FSCF_nnnnnnnn -b y -l h_vmem=20G,virtual_free=20G ./join_default_params.r1.fragScaff/run_array.csh</div>

<div class=""><br class="">

</div>

<div class="">The first problem I had was that once the job array threads started spawning the original thread, running on the master node, would be killed, apparently by the kernel. &nbsp;I figured maybe there was some memory resource constraint, so I edited the

 Starcluster config file so that the master node would not be an execution host, following the instructions here:</div>

<div class=""><br class="">

</div>

<div class=""><a href="http://star.mit.edu/cluster/docs/0.93.3/plugins/sge.html" class="">http://star.mit.edu/cluster/docs/0.93.3/plugins/sge.html</a></div>

<div class=""><br class="">

</div>

<div class="">Then I re-launched the cluster and tried again. &nbsp;This time no job array jobs ran on the master node, and the original process was not killed. &nbsp;However after a fraction of the subprocesses (maybe several hundred) had spawned and completed the job

 array stalled out. &nbsp;All remaining jobs ended up in qw state, and a qhost command showed all the nodes idle.</div>

<div class=""><br class="">

</div>

<div class="">Any ideas what might have happened and/or how I might diagnose the problem further?</div>

</body>

</html>