[StarCluster] AWS instance runs out of memory and swaps

Amirhossein Kiani amirhkiani at gmail.com
Wed Nov 16 21:00:08 EST 2011


Dear all, 

I even wrote the queue submission script myself, adding the mem_free=MEM_NEEDED,h_vmem=MEM_MAX parameter but sometimes two jobs are randomly sent to one node that does not have enough memory for two jobs and they start running. I think the SGE should check on the instance memory and not run multiple jobs on a machine when the memory requirement for the jobs in total is above the memory available in the node (or maybe there is a bug in the current check)

Amir

On Nov 8, 2011, at 5:37 PM, Amirhossein Kiani wrote:

> Hi Justin,
> 
> I'm using a third-party tool to submit the jobs but I am setting the hard limit.
> For all my jobs I have something like this for the job description:
> 
> [root at master test]# qstat -j 1
> ==============================================================
> job_number:                 1
> exec_file:                  job_scripts/1
> submission_time:            Tue Nov  8 17:31:39 2011
> owner:                      root
> uid:                        0
> group:                      root
> gid:                        0
> sge_o_home:                 /root
> sge_o_log_name:             root
> sge_o_path:                 /home/apps/bin:/home/apps/vcftools_0.1.7/bin:/home/apps/tabix-0.2.5:/home/apps/BEDTools-Version-2.14.2/bin:/home/apps/samtools/bcftools:/home/apps/samtools:/home/apps/bwa-0.5.9:/home/apps/Python-2.7.2:/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/bin:/home/apps/sjm-1.0/bin:/home/apps/hugeseq/bin:/usr/lib64/openmpi/1.4-gcc/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/cuda/bin:/usr/local/cuda/computeprof/bin:/usr/local/cuda/open64/bin:/opt/sge6/bin/lx24-amd64:/root/bin
> sge_o_shell:                /bin/bash
> sge_o_workdir:              /data/test
> sge_o_host:                 master
> account:                    sge
> stderr_path_list:           NONE:master:/data/log/SAMPLE.bin_aln-chr1_e111108173139.txt
> hard resource_list:         h_vmem=12000M
> mail_list:                  root at master
> notify:                     FALSE
> job_name:                   SAMPLE.bin_aln-chr1
> stdout_path_list:           NONE:master:/data/log/SAMPLE.bin_aln-chr1_o111108173139.txt
> jobshare:                   0
> hard_queue_list:            all.q
> env_list:                   
> job_args:                   -c,		/home/apps/hugeseq/bin/hugeseq_mod.sh bin_sam.sh chr1 /data/chr1.bam /data/bwa_small.bam && 		/home/apps/hugeseq/bin/hugeseq_mod.sh sam_index.sh /data/chr1.bam 
> script_file:                /bin/sh
> verify_suitable_queues:     2
> scheduling info:            (Collecting of scheduler job information is turned off)
> 
> And I'm using the Cluster GPU Quadruple Extra Large instances which I think has about 23G memory. The issue that I see is too many of the jobs are submitted. I guess I need to set the mem_free too? (the problem is the tool im using does not seem to have a way tot set that...)
> 
> Many thanks,
> Amir
> 
> On Nov 8, 2011, at 5:47 AM, Justin Riley wrote:
> 
>> 
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>> 
>> Hi Amirhossein,
>> 
>> Did you specify the memory usage in your job script or at command line and what parameters did you use exactly?
>> 
>> Doing a quick search I believe that the following will solve the problem although I haven't tested myself:
>> 
>> $ qsub -l mem_free=MEM_NEEDED,h_vmem=MEM_MAX yourjob.sh
>> 
>> Here, MEM_NEEDED and MEM_MAX are the lower and upper bounds for your job's memory requirements.
>> 
>> HTH,
>> 
>> ~Justin
>> 
>> On 7/22/64 2:59 PM, Amirhossein Kiani wrote:
>> > Dear Star Cluster users,
>> 
>>       >
>> 
>>       > I'm using Star Cluster to set up an SGE and when I ran my job
>>       list, although I had specified the memory usage for each job, it
>>       submitted too many jobs on my instance and my instance started
>>       going out of memory and swapping.
>> 
>>       > I wonder if anyone knows how I could tell the SGE the max
>>       memory to consider when submitting jobs to each node so that it
>>       doesn't run the jobs if there is not enough memory available on a
>>       node.
>> 
>>       >
>> 
>>       > I'm using the Cluster GPU Quadruple Extra Large instances.
>> 
>>       >
>> 
>>       > Many thanks,
>> 
>>       > Amirhossein Kiani
>> 
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1.4.11 (Darwin)
>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>> 
>> iEYEARECAAYFAk65MvgACgkQ4llAkMfDcrl4TACeNxwd6SWRNeEc14NE0MbXn+4M
>> r6gAoJL+MWdLet1LILxfaesTGhXfVyNs
>> =dcOo
>> -----END PGP SIGNATURE-----
>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20111116/16c1f744/attachment.htm


More information about the StarCluster mailing list