<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Dear Rayson,<div><br></div><div>Did you have a chance to test your solution on this? Basically, all I want is to prevent a job from running on an instance if it does not have the memory required for the job.</div><div><br></div><div>I would very much appreciate your help!</div><div><br></div><div>Many thanks,</div><div>Amir</div><div><br></div><div><br><div><div>On Nov 21, 2011, at 10:29 AM, Rayson Ho wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div><div style="color:#000; background-color:#fff; font-family:times new roman, new york, times, serif;font-size:12pt"><div>Amir,</div><div><br></div><div>You can use qhost to list all the node and resources that each node has.</div><div><br></div><div>I have an answer to the memory issue, but I have not have time to properly type up a response and test it.<br></div><div><br></div><div>Rayson</div><div><br></div><div><br></div><div><br></div> <div style="font-family: times new roman,new york,times,serif; font-size: 12pt;"> <div style="font-family: times new roman,new york,times,serif; font-size: 12pt;"> <font size="2" face="Arial"> <hr size="1"> <b><span style="font-weight: bold;">From:</span></b> Amirhossein Kiani <<a href="mailto:amirhkiani@gmail.com">amirhkiani@gmail.com</a>><br> <b><span style="font-weight: bold;">To:</span></b> Justin Riley <<a href="mailto:justin.t.riley@gmail.com">justin.t.riley@gmail.com</a>> <br><b><span style="font-weight: bold;">Cc:</span></b> Rayson Ho <<a href="mailto:rayrayson@gmail.com">rayrayson@gmail.com</a>>;
"<a href="mailto:starcluster@mit.edu">starcluster@mit.edu</a>" <<a href="mailto:starcluster@mit.edu">starcluster@mit.edu</a>> <br> <b><span style="font-weight: bold;">Sent:</span></b> Monday, November 21, 2011 1:26 PM<br> <b><span style="font-weight: bold;">Subject:</span></b> Re: [StarCluster] AWS instance runs out of memory and swaps<br> </font> <br>
Hi Justin,<br><br>Many thanks for your reply.<br>I don't have any issue with multiple jobs running per node if there is enough memory for them. But since I know about the nature of my jobs, I can predict that only one per node should be running.<br>How can I see how much memory does SGE think each node have? Is there a way to list that?<br><br>Regards,<br>Amir<br><br><br>On Nov 21, 2011, at 8:18 AM, Justin Riley wrote:<br><br>> Hi Amir,<br>> <br>> Sorry to hear you're still having issues. This is really more of an SGE<br>> issue more than anything but perhaps Rayson can give a better insight as<br>> to what's going on. It seems you're using 23G nodes and 12GB jobs. Just<br>> for drill does 'qhost' show each node having 23GB? Definitely seems like<br>> there's a boundary issue here given that two of your jobs together<br>> approaches the total memory of the machine (23GB). Is it your goal only<br>> to have one job per
node?<br>> <br>> ~Justin<br>> <br>> On 11/16/2011 09:00 PM, Amirhossein Kiani wrote:<br>>> Dear all, <br>>> <br>>> I even wrote the queue submission script myself, adding<br>>> the mem_free=MEM_NEEDED,h_vmem=MEM_MAX parameter but sometimes two jobs<br>>> are randomly sent to one node that does not have enough memory for two<br>>> jobs and they start running. I think the SGE should check on the<br>>> instance memory and not run multiple jobs on a machine when the memory<br>>> requirement for the jobs in total is above the memory available in the<br>>> node (or maybe there is a bug in the current check)<br>>> <br>>> Amir<br>>> <br>>> On Nov 8, 2011, at 5:37 PM, Amirhossein Kiani wrote:<br>>> <br>>>> Hi Justin,<br>>>> <br>>>> I'm using a third-party tool to submit the jobs but I am setting the<br>>>> hard
limit.<br>>>> For all my jobs I have something like this for the job description:<br>>>> <br>>>> [root@master test]# qstat -j 1<br>>>> ==============================================================<br>>>> job_number: 1<br>>>> exec_file: job_scripts/1<br>>>> submission_time: Tue Nov 8 17:31:39 2011<br>>>> owner: root<br>>>> uid: 0<br>>>> group: root<br>>>> gid: 0<br>>>>
sge_o_home: /root<br>>>> sge_o_log_name: root<br>>>> sge_o_path: <br>>>> /home/apps/bin:/home/apps/vcftools_0.1.7/bin:/home/apps/tabix-0.2.5:/home/apps/BEDTools-Version-2.14.2/bin:/home/apps/samtools/bcftools:/home/apps/samtools:/home/apps/bwa-0.5.9:/home/apps/Python-2.7.2:/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/bin:/home/apps/sjm-1.0/bin:/home/apps/hugeseq/bin:/usr/lib64/openmpi/1.4-gcc/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/cuda/bin:/usr/local/cuda/computeprof/bin:/usr/local/cuda/open64/bin:/opt/sge6/bin/lx24-amd64:/root/bin<br>>>> sge_o_shell: /bin/bash<br>>>> sge_o_workdir:
/data/test<br>>>> sge_o_host: master<br>>>> account: sge<br>>>> stderr_path_list: <br>>>> NONE:master:/data/log/SAMPLE.bin_aln-chr1_e111108173139.txt<br>>>> *hard resource_list: h_vmem=12000M*<br>>>> mail_list: root@master<br>>>> notify: FALSE<br>>>> job_name: SAMPLE.bin_aln-chr1<br>>>> stdout_path_list: <br>>>> NONE:master:/data/log/SAMPLE.bin_aln-chr1_o111108173139.txt<br>>>> jobshare:
0<br>>>> hard_queue_list: all.q<br>>>> env_list: <br>>>> job_args: -c,/home/apps/hugeseq/bin/hugeseq_mod.sh<br>>>> <a target="_blank" href="http://bin_sam.sh/">bin_sam.sh</a> chr1 /data/chr1.bam /data/bwa_small.bam &&<br>>>> /home/apps/hugeseq/bin/hugeseq_mod.sh <a target="_blank" href="http://sam_index.sh/">sam_index.sh</a> /data/chr1.bam <br>>>> script_file: /bin/sh<br>>>> verify_suitable_queues: 2<br>>>> scheduling info: (Collecting of scheduler job information<br>>>> is turned off)<br>>>> <br>>>> And I'm using the Cluster GPU Quadruple Extra Large instances which
I<br>>>> think has about 23G memory. The issue that I see is too many of the<br>>>> jobs are submitted. I guess I need to set the mem_free too? (the<br>>>> problem is the tool im using does not seem to have a way tot set that...)<br>>>> <br>>>> Many thanks,<br>>>> Amir<br>>>> <br>>>> On Nov 8, 2011, at 5:47 AM, Justin Riley wrote:<br>>>> <br>>>>> <br>>> Hi Amirhossein,<br>>> <br>>> Did you specify the memory usage in your job script or at command<br>>> line and what parameters did you use exactly?<br>>> <br>>> Doing a quick search I believe that the following will solve the<br>>> problem although I haven't tested myself:<br>>> <br>>> $ qsub -l mem_free=MEM_NEEDED,h_vmem=MEM_MAX <a target="_blank" href="http://yourjob.sh/">yourjob.sh</a><br>>> <br>>> Here, MEM_NEEDED and MEM_MAX are the lower and
upper bounds for your<br>>> job's memory requirements.<br>>> <br>>> HTH,<br>>> <br>>> ~Justin<br>>> <br>>> On 7/22/64 2:59 PM, Amirhossein Kiani wrote:<br>>>> Dear Star Cluster users,<br>>> <br>>>> I'm using Star Cluster to set up an SGE and when I ran my job list,<br>>> although I had specified the memory usage for each job, it submitted<br>>> too many jobs on my instance and my instance started going out of<br>>> memory and swapping.<br>>>> I wonder if anyone knows how I could tell the SGE the max memory to<br>>> consider when submitting jobs to each node so that it doesn't run the<br>>> jobs if there is not enough memory available on a node.<br>>> <br>>>> I'm using the Cluster GPU Quadruple Extra Large instances.<br>>> <br>>>> Many thanks,<br>>>> Amirhossein Kiani<br>>> <br>>>>>
<br>>>> <br>>> <br>>> <br>>> <br>>> _______________________________________________<br>>> StarCluster mailing list<br>>> <a ymailto="mailto:StarCluster@mit.edu" href="mailto:StarCluster@mit.edu">StarCluster@mit.edu</a><br>>> <a href="http://mailman.mit.edu/mailman/listinfo/starcluster">http://mailman.mit.edu/mailman/listinfo/starcluster</a><br>> <br><br><br>_______________________________________________<br>StarCluster mailing list<br><a ymailto="mailto:StarCluster@mit.edu" href="mailto:StarCluster@mit.edu">StarCluster@mit.edu</a><br><a href="http://mailman.mit.edu/mailman/listinfo/starcluster" target="_blank">http://mailman.mit.edu/mailman/listinfo/starcluster</a><br><br><br> </div> </div> </div></div></blockquote></div><br></div></body></html>