[StarCluster] sge job array stalling

Thu Jul 14 18:04:09 EDT 2016

Well, I did a little investigation and learned a bit more about what’s going on.  First, I learned that it’s necessary to tell SGE how much memory each node has, and to make memory a “consumable resource”.

I did this using these commands:

qconf -rattr exechost complex_values slots=16,num_proc=16,h_vmem=31497297920 master
qconf -rattr exechost complex_values slots=16,num_proc=16,h_vmem=31497297920 node001
qconf -rattr exechost complex_values slots=16,num_proc=16,h_vmem=31497297920 node002
qconf -rattr exechost complex_values slots=16,num_proc=16,h_vmem=31497297920 node003
qconf -rattr exechost complex_values slots=16,num_proc=16,h_vmem=31497297920 node004
qconf -rattr exechost complex_values slots=16,num_proc=16,h_vmem=31497297920 node005
qconf -rattr exechost complex_values slots=16,num_proc=16,h_vmem=31497297920 node006
qconf -rattr exechost complex_values slots=16,num_proc=16,h_vmem=31497297920 node007

and by using the qconf -mc command to add the 2nd “YES” in the following line in the configurations:

h_vmem              h_vmem     MEMORY      <=    YES         YES        2g       0

After that, sge and the kernel stopped killing my processes, so that was good.  I was able to run processes on all the slave nodes plus the master.

But then the job array stalled out again.  I used qstat -j and qacct -j to investigate why.  I turns out that the disk space was full.  That was a surprise to be because the c3.4xlarge instances are supposed to have 160Gb storage, which should me more than enough.  It turns out they do have 160Gb, but in a different partition than the one my job array was using.  This is what I got with a df -h command:

Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1      9.9G  9.4G   12M 100% /
udev             15G  8.0K   15G   1% /dev
tmpfs           5.9G  176K  5.9G   1% /run
none            5.0M     0  5.0M   0% /run/lock
none             15G     0   15G   0% /run/shm
/dev/xvdaa      151G  188M  143G   1% /mnt

When I run my program from /home/sgeadmin/ it seems to use the small partition (root /).  I tried to get it to use the larger /mnt partition by running the program from /mnt/sgeadmin but the job array didn’t work.  The qsub command was issued but all the jobs went to qw state and stayed there.  This is what qstat -j reported at that point:

job-array tasks:            1-6979:1
scheduling info:            queue instance "all.q at node003" dropped because it is temporarily not available
                            queue instance "all.q at node007" dropped because it is temporarily not available
                            queue instance "all.q at node001" dropped because it is temporarily not available
                            queue instance "all.q at node005" dropped because it is temporarily not available
                            queue instance "all.q at node002" dropped because it is temporarily not available
                            queue instance "all.q at node006" dropped because it is temporarily not available
                            queue instance "all.q at master" dropped because it is temporarily not available
                            queue instance "all.q at node004" dropped because it is temporarily not available
                            All queues dropped because of overload or full
                            not all array task may be started due to 'max_aj_instances'

How can I get the program to use the larger partition so that it doesn’t run out of disk space?

Below is my original message for context:

********************************************************************************************************************************************************************************************************************************************

I tried to run an SGE job array using Starcluster on a cluster of 8 c3.4xlarge instances.  The program I was trying to run is a perl program called fragScaff.pl.  You start it once and it creates a job array batch script which it then submits with a qsub command.  The original process runs in the background, continuously monitoring the spawned processes for completion.  The job array batch script (run_array.csh) looks like this:

#/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -V
COMMAND=$(head -n $SGE_TASK_ID ./join_default_params.r1.fragScaff/job_array.txt | tail -n 1)
$COMMAND

and the job_array.txt file looks like this:

perl fragScaff.pl -Q ./join_default_params.r1.fragScaff,697848,0,99 -r 1 -M 200
perl fragScaff.pl -Q ./join_default_params.r1.fragScaff,697848,100,199 -r 1 -M 200
perl fragScaff.pl -Q ./join_default_params.r1.fragScaff,697848,200,299 -r 1 -M 200
perl fragScaff.pl -Q ./join_default_params.r1.fragScaff,697848,300,399 -r 1 -M 200
.
.
.
perl fragScaff.pl -Q ./join_default_params.r1.fragScaff,697848,697800,697848 -r 1 -M 200

The qsub command looks like this:

qsub -t 1-6780 -N FSCF_nnnnnnnn -b y -l h_vmem=20G,virtual_free=20G ./join_default_params.r1.fragScaff/run_array.csh

The first problem I had was that once the job array threads started spawning the original thread, running on the master node, would be killed, apparently by the kernel.  I figured maybe there was some memory resource constraint, so I edited the Starcluster config file so that the master node would not be an execution host, following the instructions here:

http://star.mit.edu/cluster/docs/0.93.3/plugins/sge.html

Then I re-launched the cluster and tried again.  This time no job array jobs ran on the master node, and the original process was not killed.  However after a fraction of the subprocesses (maybe several hundred) had spawned and completed the job array stalled out.  All remaining jobs ended up in qw state, and a qhost command showed all the nodes idle.

Any ideas what might have happened and/or how I might diagnose the problem further?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20160714/39cc6be4/attachment.html