<html><body><div style="color:#000; background-color:#fff; font-family:times new roman, new york, times, serif;font-size:12pt"><div>Amir,</div><div><br></div><div>You can use qhost to list all the node and resources that each node has.</div><div><br></div><div>I have an answer to the memory issue, but I have not have time to properly type up a response and test it.<br></div><div><br></div><div>Rayson</div><div><br></div><div><br></div><div><br></div>  <div style="font-family: times new roman,new york,times,serif; font-size: 12pt;"> <div style="font-family: times new roman,new york,times,serif; font-size: 12pt;"> <font size="2" face="Arial"> <hr size="1">  <b><span style="font-weight: bold;">From:</span></b> Amirhossein Kiani &lt;amirhkiani@gmail.com&gt;<br> <b><span style="font-weight: bold;">To:</span></b> Justin Riley &lt;justin.t.riley@gmail.com&gt; <br><b><span style="font-weight: bold;">Cc:</span></b> Rayson Ho &lt;rayrayson@gmail.com&gt;;

 "starcluster@mit.edu" &lt;starcluster@mit.edu&gt; <br> <b><span style="font-weight: bold;">Sent:</span></b> Monday, November 21, 2011 1:26 PM<br> <b><span style="font-weight: bold;">Subject:</span></b> Re: [StarCluster] AWS instance runs out of memory and swaps<br> </font> <br>

Hi Justin,<br><br>Many thanks for your reply.<br>I don't have any issue with multiple jobs running per node if there is enough memory for them. But since I know about the nature of my jobs, I can predict that only one per node should be running.<br>How can I see how much memory does SGE think each node have? Is there a way to list that?<br><br>Regards,<br>Amir<br><br><br>On Nov 21, 2011, at 8:18 AM, Justin Riley wrote:<br><br>&gt; Hi Amir,<br>&gt; <br>&gt; Sorry to hear you're still having issues. This is really more of an SGE<br>&gt; issue more than anything but perhaps Rayson can give a better insight as<br>&gt; to what's going on. It seems you're using 23G nodes and 12GB jobs. Just<br>&gt; for drill does 'qhost' show each node having 23GB? Definitely seems like<br>&gt; there's a boundary issue here given that two of your jobs together<br>&gt; approaches the total memory of the machine (23GB). Is it your goal only<br>&gt; to have one job per

 node?<br>&gt; <br>&gt; ~Justin<br>&gt; <br>&gt; On 11/16/2011 09:00 PM, Amirhossein Kiani wrote:<br>&gt;&gt; Dear all, <br>&gt;&gt; <br>&gt;&gt; I even wrote the queue submission script myself, adding<br>&gt;&gt; the mem_free=MEM_NEEDED,h_vmem=MEM_MAX parameter but sometimes two jobs<br>&gt;&gt; are randomly sent to one node that does not have enough memory for two<br>&gt;&gt; jobs and they start running. I think the SGE should check on the<br>&gt;&gt; instance memory and not run multiple jobs on a machine when the memory<br>&gt;&gt; requirement for the jobs in total is above the memory available in the<br>&gt;&gt; node (or maybe there is a bug in the current check)<br>&gt;&gt; <br>&gt;&gt; Amir<br>&gt;&gt; <br>&gt;&gt; On Nov 8, 2011, at 5:37 PM, Amirhossein Kiani wrote:<br>&gt;&gt; <br>&gt;&gt;&gt; Hi Justin,<br>&gt;&gt;&gt; <br>&gt;&gt;&gt; I'm using a third-party tool to submit the jobs but I am setting the<br>&gt;&gt;&gt; hard

 limit.<br>&gt;&gt;&gt; For all my jobs I have something like this for the job description:<br>&gt;&gt;&gt; <br>&gt;&gt;&gt; [root@master test]# qstat -j 1<br>&gt;&gt;&gt; ==============================================================<br>&gt;&gt;&gt; job_number:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;  1<br>&gt;&gt;&gt; exec_file:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; job_scripts/1<br>&gt;&gt;&gt; submission_time:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Tue Nov&nbsp; 8 17:31:39 2011<br>&gt;&gt;&gt; owner:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; root<br>&gt;&gt;&gt; uid:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0<br>&gt;&gt;&gt; group:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; root<br>&gt;&gt;&gt; gid:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0<br>&gt;&gt;&gt;

 sge_o_home:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;  /root<br>&gt;&gt;&gt; sge_o_log_name:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;  root<br>&gt;&gt;&gt; sge_o_path:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <br>&gt;&gt;&gt; /home/apps/bin:/home/apps/vcftools_0.1.7/bin:/home/apps/tabix-0.2.5:/home/apps/BEDTools-Version-2.14.2/bin:/home/apps/samtools/bcftools:/home/apps/samtools:/home/apps/bwa-0.5.9:/home/apps/Python-2.7.2:/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/bin:/home/apps/sjm-1.0/bin:/home/apps/hugeseq/bin:/usr/lib64/openmpi/1.4-gcc/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/cuda/bin:/usr/local/cuda/computeprof/bin:/usr/local/cuda/open64/bin:/opt/sge6/bin/lx24-amd64:/root/bin<br>&gt;&gt;&gt; sge_o_shell:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; /bin/bash<br>&gt;&gt;&gt; sge_o_workdir:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;

 /data/test<br>&gt;&gt;&gt; sge_o_host:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;  master<br>&gt;&gt;&gt; account:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; sge<br>&gt;&gt;&gt; stderr_path_list:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <br>&gt;&gt;&gt; NONE:master:/data/log/SAMPLE.bin_aln-chr1_e111108173139.txt<br>&gt;&gt;&gt; *hard resource_list:&nbsp; &nbsp; &nbsp; &nbsp;  h_vmem=12000M*<br>&gt;&gt;&gt; mail_list:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; root@master<br>&gt;&gt;&gt; notify:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;  FALSE<br>&gt;&gt;&gt; job_name:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;  SAMPLE.bin_aln-chr1<br>&gt;&gt;&gt; stdout_path_list:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <br>&gt;&gt;&gt; NONE:master:/data/log/SAMPLE.bin_aln-chr1_o111108173139.txt<br>&gt;&gt;&gt; jobshare:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;

  0<br>&gt;&gt;&gt; hard_queue_list:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; all.q<br>&gt;&gt;&gt; env_list:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;  <br>&gt;&gt;&gt; job_args:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;  -c,/home/apps/hugeseq/bin/hugeseq_mod.sh<br>&gt;&gt;&gt; <a target="_blank" href="http://bin_sam.sh">bin_sam.sh</a> chr1 /data/chr1.bam /data/bwa_small.bam &amp;&amp;<br>&gt;&gt;&gt; /home/apps/hugeseq/bin/hugeseq_mod.sh <a target="_blank" href="http://sam_index.sh">sam_index.sh</a> /data/chr1.bam <br>&gt;&gt;&gt; script_file:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; /bin/sh<br>&gt;&gt;&gt; verify_suitable_queues:&nbsp; &nbsp;  2<br>&gt;&gt;&gt; scheduling info:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (Collecting of scheduler job information<br>&gt;&gt;&gt; is turned off)<br>&gt;&gt;&gt; <br>&gt;&gt;&gt; And I'm using the Cluster GPU Quadruple Extra Large instances which

 I<br>&gt;&gt;&gt; think has about 23G memory. The issue that I see is too many of the<br>&gt;&gt;&gt; jobs are submitted. I guess I need to set the mem_free too? (the<br>&gt;&gt;&gt; problem is the tool im using does not seem to have a way tot set that...)<br>&gt;&gt;&gt; <br>&gt;&gt;&gt; Many thanks,<br>&gt;&gt;&gt; Amir<br>&gt;&gt;&gt; <br>&gt;&gt;&gt; On Nov 8, 2011, at 5:47 AM, Justin Riley wrote:<br>&gt;&gt;&gt; <br>&gt;&gt;&gt;&gt; <br>&gt;&gt; Hi Amirhossein,<br>&gt;&gt; <br>&gt;&gt; Did you specify the memory usage in your job script or at command<br>&gt;&gt; line and what parameters did you use exactly?<br>&gt;&gt; <br>&gt;&gt; Doing a quick search I believe that the following will solve the<br>&gt;&gt; problem although I haven't tested myself:<br>&gt;&gt; <br>&gt;&gt; $ qsub -l mem_free=MEM_NEEDED,h_vmem=MEM_MAX <a target="_blank" href="http://yourjob.sh">yourjob.sh</a><br>&gt;&gt; <br>&gt;&gt; Here, MEM_NEEDED and MEM_MAX are the lower and

 upper bounds for your<br>&gt;&gt; job's memory requirements.<br>&gt;&gt; <br>&gt;&gt; HTH,<br>&gt;&gt; <br>&gt;&gt; ~Justin<br>&gt;&gt; <br>&gt;&gt; On 7/22/64 2:59 PM, Amirhossein Kiani wrote:<br>&gt;&gt;&gt; Dear Star Cluster users,<br>&gt;&gt; <br>&gt;&gt;&gt; I'm using Star Cluster to set up an SGE and when I ran my job list,<br>&gt;&gt; although I had specified the memory usage for each job, it submitted<br>&gt;&gt; too many jobs on my instance and my instance started going out of<br>&gt;&gt; memory and swapping.<br>&gt;&gt;&gt; I wonder if anyone knows how I could tell the SGE the max memory to<br>&gt;&gt; consider when submitting jobs to each node so that it doesn't run the<br>&gt;&gt; jobs if there is not enough memory available on a node.<br>&gt;&gt; <br>&gt;&gt;&gt; I'm using the Cluster GPU Quadruple Extra Large instances.<br>&gt;&gt; <br>&gt;&gt;&gt; Many thanks,<br>&gt;&gt;&gt; Amirhossein Kiani<br>&gt;&gt; <br>&gt;&gt;&gt;&gt;

 <br>&gt;&gt;&gt; <br>&gt;&gt; <br>&gt;&gt; <br>&gt;&gt; <br>&gt;&gt; _______________________________________________<br>&gt;&gt; StarCluster mailing list<br>&gt;&gt; <a ymailto="mailto:StarCluster@mit.edu" href="mailto:StarCluster@mit.edu">StarCluster@mit.edu</a><br>&gt;&gt; http://mailman.mit.edu/mailman/listinfo/starcluster<br>&gt; <br><br><br>_______________________________________________<br>StarCluster mailing list<br><a ymailto="mailto:StarCluster@mit.edu" href="mailto:StarCluster@mit.edu">StarCluster@mit.edu</a><br><a href="http://mailman.mit.edu/mailman/listinfo/starcluster" target="_blank">http://mailman.mit.edu/mailman/listinfo/starcluster</a><br><br><br> </div> </div>  </div></body></html>