[StarCluster] AWS instance runs out of memory and swaps

Don MacMillen macd at nimbic.com
Mon Nov 21 15:49:45 EST 2011


If you just want to restrict to a single job on each node,
you can write a plug in that sets the slots to 1 by using
a command something like:

 def run(self, nodes, master, user, user_shell, volumes):
       for node in nodes:
            cmd_strg = 'qconf -mattr exechost complex_values slots=1 %s' %
node.alias
            output = master.ssh.execute(cmd_strg)

You will need to look at the starcluster plugin documentation
to set everything up correctly. hth.

Don

On Mon, Nov 21, 2011 at 10:29 AM, Rayson Ho <raysonlogin at yahoo.com> wrote:

> Amir,
>
> You can use qhost to list all the node and resources that each node has.
>
> I have an answer to the memory issue, but I have not have time to properly
> type up a response and test it.
>
> Rayson
>
>
>
>   ------------------------------
> *From:* Amirhossein Kiani <amirhkiani at gmail.com>
> *To:* Justin Riley <justin.t.riley at gmail.com>
> *Cc:* Rayson Ho <rayrayson at gmail.com>; "starcluster at mit.edu" <
> starcluster at mit.edu>
> *Sent:* Monday, November 21, 2011 1:26 PM
> *Subject:* Re: [StarCluster] AWS instance runs out of memory and swaps
>
> Hi Justin,
>
> Many thanks for your reply.
> I don't have any issue with multiple jobs running per node if there is
> enough memory for them. But since I know about the nature of my jobs, I can
> predict that only one per node should be running.
> How can I see how much memory does SGE think each node have? Is there a
> way to list that?
>
> Regards,
> Amir
>
>
> On Nov 21, 2011, at 8:18 AM, Justin Riley wrote:
>
> > Hi Amir,
> >
> > Sorry to hear you're still having issues. This is really more of an SGE
> > issue more than anything but perhaps Rayson can give a better insight as
> > to what's going on. It seems you're using 23G nodes and 12GB jobs. Just
> > for drill does 'qhost' show each node having 23GB? Definitely seems like
> > there's a boundary issue here given that two of your jobs together
> > approaches the total memory of the machine (23GB). Is it your goal only
> > to have one job per node?
> >
> > ~Justin
> >
> > On 11/16/2011 09:00 PM, Amirhossein Kiani wrote:
> >> Dear all,
> >>
> >> I even wrote the queue submission script myself, adding
> >> the mem_free=MEM_NEEDED,h_vmem=MEM_MAX parameter but sometimes two jobs
> >> are randomly sent to one node that does not have enough memory for two
> >> jobs and they start running. I think the SGE should check on the
> >> instance memory and not run multiple jobs on a machine when the memory
> >> requirement for the jobs in total is above the memory available in the
> >> node (or maybe there is a bug in the current check)
> >>
> >> Amir
> >>
> >> On Nov 8, 2011, at 5:37 PM, Amirhossein Kiani wrote:
> >>
> >>> Hi Justin,
> >>>
> >>> I'm using a third-party tool to submit the jobs but I am setting the
> >>> hard limit.
> >>> For all my jobs I have something like this for the job description:
> >>>
> >>> [root at master test]# qstat -j 1
> >>> ==============================================================
> >>> job_number:                1
> >>> exec_file:                  job_scripts/1
> >>> submission_time:            Tue Nov  8 17:31:39 2011
> >>> owner:                      root
> >>> uid:                        0
> >>> group:                      root
> >>> gid:                        0
> >>> sge_o_home:                /root
> >>> sge_o_log_name:            root
> >>> sge_o_path:
> >>>
> /home/apps/bin:/home/apps/vcftools_0.1.7/bin:/home/apps/tabix-0.2.5:/home/apps/BEDTools-Version-2.14.2/bin:/home/apps/samtools/bcftools:/home/apps/samtools:/home/apps/bwa-0.5.9:/home/apps/Python-2.7.2:/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/bin:/home/apps/sjm-1.0/bin:/home/apps/hugeseq/bin:/usr/lib64/openmpi/1.4-gcc/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/cuda/bin:/usr/local/cuda/computeprof/bin:/usr/local/cuda/open64/bin:/opt/sge6/bin/lx24-amd64:/root/bin
> >>> sge_o_shell:                /bin/bash
> >>> sge_o_workdir:              /data/test
> >>> sge_o_host:                master
> >>> account:                    sge
> >>> stderr_path_list:
> >>> NONE:master:/data/log/SAMPLE.bin_aln-chr1_e111108173139.txt
> >>> *hard resource_list:        h_vmem=12000M*
> >>> mail_list:                  root at master
> >>> notify:                    FALSE
> >>> job_name:                  SAMPLE.bin_aln-chr1
> >>> stdout_path_list:
> >>> NONE:master:/data/log/SAMPLE.bin_aln-chr1_o111108173139.txt
> >>> jobshare:                  0
> >>> hard_queue_list:            all.q
> >>> env_list:
> >>> job_args:                  -c,/home/apps/hugeseq/bin/hugeseq_mod.sh
> >>> bin_sam.sh chr1 /data/chr1.bam /data/bwa_small.bam &&
> >>> /home/apps/hugeseq/bin/hugeseq_mod.sh sam_index.sh /data/chr1.bam
> >>> script_file:                /bin/sh
> >>> verify_suitable_queues:    2
> >>> scheduling info:            (Collecting of scheduler job information
> >>> is turned off)
> >>>
> >>> And I'm using the Cluster GPU Quadruple Extra Large instances which I
> >>> think has about 23G memory. The issue that I see is too many of the
> >>> jobs are submitted. I guess I need to set the mem_free too? (the
> >>> problem is the tool im using does not seem to have a way tot set
> that...)
> >>>
> >>> Many thanks,
> >>> Amir
> >>>
> >>> On Nov 8, 2011, at 5:47 AM, Justin Riley wrote:
> >>>
> >>>>
> >> Hi Amirhossein,
> >>
> >> Did you specify the memory usage in your job script or at command
> >> line and what parameters did you use exactly?
> >>
> >> Doing a quick search I believe that the following will solve the
> >> problem although I haven't tested myself:
> >>
> >> $ qsub -l mem_free=MEM_NEEDED,h_vmem=MEM_MAX yourjob.sh
> >>
> >> Here, MEM_NEEDED and MEM_MAX are the lower and upper bounds for your
> >> job's memory requirements.
> >>
> >> HTH,
> >>
> >> ~Justin
> >>
> >> On 7/22/64 2:59 PM, Amirhossein Kiani wrote:
> >>> Dear Star Cluster users,
> >>
> >>> I'm using Star Cluster to set up an SGE and when I ran my job list,
> >> although I had specified the memory usage for each job, it submitted
> >> too many jobs on my instance and my instance started going out of
> >> memory and swapping.
> >>> I wonder if anyone knows how I could tell the SGE the max memory to
> >> consider when submitting jobs to each node so that it doesn't run the
> >> jobs if there is not enough memory available on a node.
> >>
> >>> I'm using the Cluster GPU Quadruple Extra Large instances.
> >>
> >>> Many thanks,
> >>> Amirhossein Kiani
> >>
> >>>>
> >>>
> >>
> >>
> >>
> >> _______________________________________________
> >> StarCluster mailing list
> >> StarCluster at mit.edu
> >> http://mailman.mit.edu/mailman/listinfo/starcluster
> >
>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster at mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster at mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20111121/d9d84b25/attachment.htm


More information about the StarCluster mailing list