[StarCluster] AWS instance runs out of memory and swaps

Mon Dec 5 17:04:25 EST 2011

Thank you SO MUCH Rayson! I think this will save so much of my time...

Amir

On Dec 5, 2011, at 1:52 PM, Rayson Ho wrote:

> Hi Amirhossein,
> 
> I was working on a few other things, and I just saw your message -- I have to spend less time on mailing list discussions these days due to the number of things that I needed to develop and/or fix, and I am also working on a new patch release of OGS/Grid Engine 2011.11. Luckily, I just found the mail that exactly solves the issue you are encountering:
> 
> http://markmail.org/message/zdj5ebfrzhnadglf
> 
> 
> For more info, see the "job_load_adjustments" and "load_adjustment_decay_time" parameters in the Grid Engine manpage:
> 
> 
> http://gridscheduler.sourceforge.net/htmlman/htmlman5/sched_conf.html
> 
> Rayson
> 
> =================================
> Grid Engine / Open Grid Scheduler
> http://gridscheduler.sourceforge.net/
> 
> Scalable Grid Engine Support Program
> http://www.scalablelogic.com/
> 
> 
> 
> 
> ________________________________
> From: Amirhossein Kiani <amirhkiani at gmail.com>
> To: Rayson Ho <raysonlogin at yahoo.com> 
> Cc: Justin Riley <justin.t.riley at gmail.com>; "starcluster at mit.edu" <starcluster at mit.edu> 
> Sent: Friday, December 2, 2011 6:36 PM
> Subject: Re: [StarCluster] AWS instance runs out of memory and swaps
> 
> 
> Dear Rayson,
> 
> Did you have a chance to test your solution on this? Basically, all I want is to prevent a job from running on an instance if it does not have the memory required for the job.
> 
> I would very much appreciate your help!
> 
> Many thanks,
> Amir
> 
> 
> 
> On Nov 21, 2011, at 10:29 AM, Rayson Ho wrote:
> 
> Amir,
>> 
>> 
>> You can use qhost to list all the node and resources that each node has.
>> 
>> 
>> I have an answer to the memory issue, but I have not have time to properly type up a response and test it.
>> 
>> 
>> 
>> Rayson
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> ________________________________
>> From: Amirhossein Kiani <amirhkiani at gmail.com>
>> To: Justin Riley <justin.t.riley at gmail.com> 
>> Cc: Rayson Ho <rayrayson at gmail.com>; "starcluster at mit.edu" <starcluster at mit.edu> 
>> Sent: Monday, November 21, 2011 1:26 PM
>> Subject: Re: [StarCluster] AWS instance runs out of memory and swaps
>> 
>> Hi Justin,
>> 
>> Many thanks for your reply.
>> I don't have any issue with multiple jobs running per node if there is enough memory for them. But since I know about the nature of my jobs, I can predict that only one per node should be running.
>> How can I see how much memory does SGE think each node have? Is there a way to list that?
>> 
>> Regards,
>> Amir
>> 
>> 
>> On Nov 21, 2011, at 8:18 AM, Justin Riley wrote:
>> 
>>> Hi Amir,
>>> 
>>> Sorry to hear you're still having issues. This is really more of an SGE
>>> issue more than anything but perhaps Rayson can give a better insight as
>>> to what's going on. It seems you're using 23G nodes and 12GB jobs. Just
>>> for drill does 'qhost' show each node having 23GB? Definitely seems like
>>> there's a boundary issue here given that two of your jobs together
>>> approaches the total memory of the machine (23GB). Is it your goal only
>>> to have one job per
> node?
>>> 
>>> ~Justin
>>> 
>>> On 11/16/2011 09:00 PM, Amirhossein Kiani wrote:
>>>> Dear all, 
>>>> 
>>>> I even wrote the queue submission script myself, adding
>>>> the mem_free=MEM_NEEDED,h_vmem=MEM_MAX parameter but sometimes two jobs
>>>> are randomly sent to one node that does not have enough memory for two
>>>> jobs and they start running. I think the SGE should check on the
>>>> instance memory and not run multiple jobs on a machine when the memory
>>>> requirement for the jobs in total is above the memory available in the
>>>> node (or maybe there is a bug in the current check)
>>>> 
>>>> Amir
>>>> 
>>>> On Nov 8, 2011, at 5:37 PM, Amirhossein Kiani wrote:
>>>> 
>>>>> Hi Justin,
>>>>> 
>>>>> I'm using a third-party tool to submit the jobs but I am setting the
>>>>> hard
> limit.
>>>>> For all my jobs I have something like this for the job description:
>>>>> 
>>>>> [root at master test]# qstat -j 1
>>>>> ==============================================================
>>>>> job_number:                 1
>>>>> exec_file:                  job_scripts/1
>>>>> submission_time:            Tue Nov  8 17:31:39 2011
>>>>> owner:                      root
>>>>> uid:                        0
>>>>> group:                      root
>>>>> gid:                        0
>>>>> 
> sge_o_home:                 /root
>>>>> sge_o_log_name:             root
>>>>> sge_o_path:                
>>>>> /home/apps/bin:/home/apps/vcftools_0.1.7/bin:/home/apps/tabix-0.2.5:/home/apps/BEDTools-Version-2.14.2/bin:/home/apps/samtools/bcftools:/home/apps/samtools:/home/apps/bwa-0.5.9:/home/apps/Python-2.7.2:/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/bin:/home/apps/sjm-1.0/bin:/home/apps/hugeseq/bin:/usr/lib64/openmpi/1.4-gcc/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/cuda/bin:/usr/local/cuda/computeprof/bin:/usr/local/cuda/open64/bin:/opt/sge6/bin/lx24-amd64:/root/bin
>>>>> sge_o_shell:                /bin/bash
>>>>> sge_o_workdir:             
> /data/test
>>>>> sge_o_host:                 master
>>>>> account:                    sge
>>>>> stderr_path_list:          
>>>>> NONE:master:/data/log/SAMPLE.bin_aln-chr1_e111108173139.txt
>>>>> *hard resource_list:         h_vmem=12000M*
>>>>> mail_list:                  root at master
>>>>> notify:                     FALSE
>>>>> job_name:                   SAMPLE.bin_aln-chr1
>>>>> stdout_path_list:          
>>>>> NONE:master:/data/log/SAMPLE.bin_aln-chr1_o111108173139.txt
>>>>> jobshare:                 
>   0
>>>>> hard_queue_list:            all.q
>>>>> env_list:                  
>>>>> job_args:                   -c,/home/apps/hugeseq/bin/hugeseq_mod.sh
>>>>> bin_sam.sh chr1 /data/chr1.bam /data/bwa_small.bam &&
>>>>> /home/apps/hugeseq/bin/hugeseq_mod.sh sam_index.sh /data/chr1.bam 
>>>>> script_file:                /bin/sh
>>>>> verify_suitable_queues:     2
>>>>> scheduling info:            (Collecting of scheduler job information
>>>>> is turned off)
>>>>> 
>>>>> And I'm using the Cluster GPU Quadruple Extra Large instances which
> I
>>>>> think has about 23G memory. The issue that I see is too many of the
>>>>> jobs are submitted. I guess I need to set the mem_free too? (the
>>>>> problem is the tool im using does not seem to have a way tot set that...)
>>>>> 
>>>>> Many thanks,
>>>>> Amir
>>>>> 
>>>>> On Nov 8, 2011, at 5:47 AM, Justin Riley wrote:
>>>>> 
>>>>>> 
>>>> Hi Amirhossein,
>>>> 
>>>> Did you specify the memory usage in your job script or at command
>>>> line and what parameters did you use exactly?
>>>> 
>>>> Doing a quick search I believe that the following will solve the
>>>> problem although I haven't tested myself:
>>>> 
>>>> $ qsub -l mem_free=MEM_NEEDED,h_vmem=MEM_MAX yourjob.sh
>>>> 
>>>> Here, MEM_NEEDED and MEM_MAX are the lower and
> upper bounds for your
>>>> job's memory requirements.
>>>> 
>>>> HTH,
>>>> 
>>>> ~Justin
>>>> 
>>>> On 7/22/64 2:59 PM, Amirhossein Kiani wrote:
>>>>> Dear Star Cluster users,
>>>> 
>>>>> I'm using Star Cluster to set up an SGE and when I ran my job list,
>>>> although I had specified the memory usage for each job, it submitted
>>>> too many jobs on my instance and my instance started going out of
>>>> memory and swapping.
>>>>> I wonder if anyone knows how I could tell the SGE the max memory to
>>>> consider when submitting jobs to each node so that it doesn't run the
>>>> jobs if there is not enough memory available on a node.
>>>> 
>>>>> I'm using the Cluster GPU Quadruple Extra Large instances.
>>>> 
>>>>> Many thanks,
>>>>> Amirhossein Kiani
>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> StarCluster mailing list
>>>> StarCluster at mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>> 
>> 
>> 
>> _______________________________________________
>> StarCluster mailing list
>> StarCluster at mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster
>> 
>> 
>>