[StarCluster] AWS instance runs out of memory and swaps
Rayson Ho
raysonlogin at yahoo.com
Mon Nov 21 13:29:55 EST 2011
Amir,
You can use qhost to list all the node and resources that each node has.
I have an answer to the memory issue, but I have not have time to properly type up a response and test it.
Rayson
________________________________
From: Amirhossein Kiani <amirhkiani at gmail.com>
To: Justin Riley <justin.t.riley at gmail.com>
Cc: Rayson Ho <rayrayson at gmail.com>; "starcluster at mit.edu" <starcluster at mit.edu>
Sent: Monday, November 21, 2011 1:26 PM
Subject: Re: [StarCluster] AWS instance runs out of memory and swaps
Hi Justin,
Many thanks for your reply.
I don't have any issue with multiple jobs running per node if there is enough memory for them. But since I know about the nature of my jobs, I can predict that only one per node should be running.
How can I see how much memory does SGE think each node have? Is there a way to list that?
Regards,
Amir
On Nov 21, 2011, at 8:18 AM, Justin Riley wrote:
> Hi Amir,
>
> Sorry to hear you're still having issues. This is really more of an SGE
> issue more than anything but perhaps Rayson can give a better insight as
> to what's going on. It seems you're using 23G nodes and 12GB jobs. Just
> for drill does 'qhost' show each node having 23GB? Definitely seems like
> there's a boundary issue here given that two of your jobs together
> approaches the total memory of the machine (23GB). Is it your goal only
> to have one job per node?
>
> ~Justin
>
> On 11/16/2011 09:00 PM, Amirhossein Kiani wrote:
>> Dear all,
>>
>> I even wrote the queue submission script myself, adding
>> the mem_free=MEM_NEEDED,h_vmem=MEM_MAX parameter but sometimes two jobs
>> are randomly sent to one node that does not have enough memory for two
>> jobs and they start running. I think the SGE should check on the
>> instance memory and not run multiple jobs on a machine when the memory
>> requirement for the jobs in total is above the memory available in the
>> node (or maybe there is a bug in the current check)
>>
>> Amir
>>
>> On Nov 8, 2011, at 5:37 PM, Amirhossein Kiani wrote:
>>
>>> Hi Justin,
>>>
>>> I'm using a third-party tool to submit the jobs but I am setting the
>>> hard limit.
>>> For all my jobs I have something like this for the job description:
>>>
>>> [root at master test]# qstat -j 1
>>> ==============================================================
>>> job_number: 1
>>> exec_file: job_scripts/1
>>> submission_time: Tue Nov 8 17:31:39 2011
>>> owner: root
>>> uid: 0
>>> group: root
>>> gid: 0
>>> sge_o_home: /root
>>> sge_o_log_name: root
>>> sge_o_path:
>>> /home/apps/bin:/home/apps/vcftools_0.1.7/bin:/home/apps/tabix-0.2.5:/home/apps/BEDTools-Version-2.14.2/bin:/home/apps/samtools/bcftools:/home/apps/samtools:/home/apps/bwa-0.5.9:/home/apps/Python-2.7.2:/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/bin:/home/apps/sjm-1.0/bin:/home/apps/hugeseq/bin:/usr/lib64/openmpi/1.4-gcc/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/cuda/bin:/usr/local/cuda/computeprof/bin:/usr/local/cuda/open64/bin:/opt/sge6/bin/lx24-amd64:/root/bin
>>> sge_o_shell: /bin/bash
>>> sge_o_workdir: /data/test
>>> sge_o_host: master
>>> account: sge
>>> stderr_path_list:
>>> NONE:master:/data/log/SAMPLE.bin_aln-chr1_e111108173139.txt
>>> *hard resource_list: h_vmem=12000M*
>>> mail_list: root at master
>>> notify: FALSE
>>> job_name: SAMPLE.bin_aln-chr1
>>> stdout_path_list:
>>> NONE:master:/data/log/SAMPLE.bin_aln-chr1_o111108173139.txt
>>> jobshare: 0
>>> hard_queue_list: all.q
>>> env_list:
>>> job_args: -c,/home/apps/hugeseq/bin/hugeseq_mod.sh
>>> bin_sam.sh chr1 /data/chr1.bam /data/bwa_small.bam &&
>>> /home/apps/hugeseq/bin/hugeseq_mod.sh sam_index.sh /data/chr1.bam
>>> script_file: /bin/sh
>>> verify_suitable_queues: 2
>>> scheduling info: (Collecting of scheduler job information
>>> is turned off)
>>>
>>> And I'm using the Cluster GPU Quadruple Extra Large instances which I
>>> think has about 23G memory. The issue that I see is too many of the
>>> jobs are submitted. I guess I need to set the mem_free too? (the
>>> problem is the tool im using does not seem to have a way tot set that...)
>>>
>>> Many thanks,
>>> Amir
>>>
>>> On Nov 8, 2011, at 5:47 AM, Justin Riley wrote:
>>>
>>>>
>> Hi Amirhossein,
>>
>> Did you specify the memory usage in your job script or at command
>> line and what parameters did you use exactly?
>>
>> Doing a quick search I believe that the following will solve the
>> problem although I haven't tested myself:
>>
>> $ qsub -l mem_free=MEM_NEEDED,h_vmem=MEM_MAX yourjob.sh
>>
>> Here, MEM_NEEDED and MEM_MAX are the lower and upper bounds for your
>> job's memory requirements.
>>
>> HTH,
>>
>> ~Justin
>>
>> On 7/22/64 2:59 PM, Amirhossein Kiani wrote:
>>> Dear Star Cluster users,
>>
>>> I'm using Star Cluster to set up an SGE and when I ran my job list,
>> although I had specified the memory usage for each job, it submitted
>> too many jobs on my instance and my instance started going out of
>> memory and swapping.
>>> I wonder if anyone knows how I could tell the SGE the max memory to
>> consider when submitting jobs to each node so that it doesn't run the
>> jobs if there is not enough memory available on a node.
>>
>>> I'm using the Cluster GPU Quadruple Extra Large instances.
>>
>>> Many thanks,
>>> Amirhossein Kiani
>>
>>>>
>>>
>>
>>
>>
>> _______________________________________________
>> StarCluster mailing list
>> StarCluster at mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster
>
_______________________________________________
StarCluster mailing list
StarCluster at mit.edu
http://mailman.mit.edu/mailman/listinfo/starcluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20111121/11b584a9/attachment.htm
More information about the StarCluster
mailing list