[StarCluster] AWS instance runs out of memory and swaps
Justin Riley
jtriley at MIT.EDU
Mon Dec 5 18:13:00 EST 2011
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Amir,
qconf is included in the StarCluster AMIs so there must be some other
issue you're facing. Also, I wouldn't recommend installing the
gridengine packages from ubuntu as they're most likely not compatible
with StarCluster's bundled version in /opt/sge6 as you're seeing.
With that said which AMI are you using and what does "echo $PATH" look
like when you login as root (via sshmaster)?
~Justin
On 12/05/2011 06:07 PM, Amirhossein Kiani wrote:
> So I tried this and couldn't run qconf because it was not
> installed. I then tried installing it using apt-get and specified
> default for the cell name and "master" for the master name which
> is the default for the SGE created using StarCluster.
>
> However now when I want to use qconf, it says:
>
> root at master:/data/stanford/aligned# qconf -msconf error: commlib
> error: got select error (Connection refused) unable to send
> message to qmaster using port 6444 on host "master": got send
> error
>
>
> Any idea how i could configure it to work?
>
>
> Many thanks, Amir
>
> On Dec 5, 2011, at 1:52 PM, Rayson Ho wrote:
>
>> Hi Amirhossein,
>>
>> I was working on a few other things, and I just saw your message
>> -- I have to spend less time on mailing list discussions these
>> days due to the number of things that I needed to develop and/or
>> fix, and I am also working on a new patch release of OGS/Grid
>> Engine 2011.11. Luckily, I just found the mail that exactly
>> solves the issue you are encountering:
>>
>> http://markmail.org/message/zdj5ebfrzhnadglf
>>
>>
>> For more info, see the "job_load_adjustments" and
>> "load_adjustment_decay_time" parameters in the Grid Engine
>> manpage:
>>
>>
>> http://gridscheduler.sourceforge.net/htmlman/htmlman5/sched_conf.html
>>
>>
>>
>>
Rayson
>>
>> ================================= Grid Engine / Open Grid
>> Scheduler http://gridscheduler.sourceforge.net/
>>
>> Scalable Grid Engine Support Program
>> http://www.scalablelogic.com/
>>
>>
>>
>>
>> ________________________________ From: Amirhossein Kiani
>> <amirhkiani at gmail.com> To: Rayson Ho <raysonlogin at yahoo.com> Cc:
>> Justin Riley <justin.t.riley at gmail.com>; "starcluster at mit.edu"
>> <starcluster at mit.edu> Sent: Friday, December 2, 2011 6:36 PM
>> Subject: Re: [StarCluster] AWS instance runs out of memory and
>> swaps
>>
>>
>> Dear Rayson,
>>
>> Did you have a chance to test your solution on this? Basically,
>> all I want is to prevent a job from running on an instance if it
>> does not have the memory required for the job.
>>
>> I would very much appreciate your help!
>>
>> Many thanks, Amir
>>
>>
>>
>> On Nov 21, 2011, at 10:29 AM, Rayson Ho wrote:
>>
>> Amir,
>>>
>>>
>>> You can use qhost to list all the node and resources that each
>>> node has.
>>>
>>>
>>> I have an answer to the memory issue, but I have not have time
>>> to properly type up a response and test it.
>>>
>>>
>>>
>>> Rayson
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> ________________________________ From: Amirhossein Kiani
>>> <amirhkiani at gmail.com> To: Justin Riley
>>> <justin.t.riley at gmail.com> Cc: Rayson Ho
>>> <rayrayson at gmail.com>; "starcluster at mit.edu"
>>> <starcluster at mit.edu> Sent: Monday, November 21, 2011 1:26 PM
>>> Subject: Re: [StarCluster] AWS instance runs out of memory and
>>> swaps
>>>
>>> Hi Justin,
>>>
>>> Many thanks for your reply. I don't have any issue with
>>> multiple jobs running per node if there is enough memory for
>>> them. But since I know about the nature of my jobs, I can
>>> predict that only one per node should be running. How can I
>>> see how much memory does SGE think each node have? Is there a
>>> way to list that?
>>>
>>> Regards, Amir
>>>
>>>
>>> On Nov 21, 2011, at 8:18 AM, Justin Riley wrote:
>>>
>>>> Hi Amir,
>>>>
>>>> Sorry to hear you're still having issues. This is really
>>>> more of an SGE issue more than anything but perhaps Rayson
>>>> can give a better insight as to what's going on. It seems
>>>> you're using 23G nodes and 12GB jobs. Just for drill does
>>>> 'qhost' show each node having 23GB? Definitely seems like
>>>> there's a boundary issue here given that two of your jobs
>>>> together approaches the total memory of the machine (23GB).
>>>> Is it your goal only to have one job per
>> node?
>>>>
>>>> ~Justin
>>>>
>>>> On 11/16/2011 09:00 PM, Amirhossein Kiani wrote:
>>>>> Dear all,
>>>>>
>>>>> I even wrote the queue submission script myself, adding
>>>>> the mem_free=MEM_NEEDED,h_vmem=MEM_MAX parameter but
>>>>> sometimes two jobs are randomly sent to one node that does
>>>>> not have enough memory for two jobs and they start running.
>>>>> I think the SGE should check on the instance memory and not
>>>>> run multiple jobs on a machine when the memory requirement
>>>>> for the jobs in total is above the memory available in the
>>>>> node (or maybe there is a bug in the current check)
>>>>>
>>>>> Amir
>>>>>
>>>>> On Nov 8, 2011, at 5:37 PM, Amirhossein Kiani wrote:
>>>>>
>>>>>> Hi Justin,
>>>>>>
>>>>>> I'm using a third-party tool to submit the jobs but I am
>>>>>> setting the hard
>> limit.
>>>>>> For all my jobs I have something like this for the job
>>>>>> description:
>>>>>>
>>>>>> [root at master test]# qstat -j 1
>>>>>> ==============================================================
>>>>>>
>>>>>>
>>>>>>
job_number: 1
>>>>>> exec_file: job_scripts/1
>>>>>> submission_time: Tue Nov 8 17:31:39 2011
>>>>>> owner: root uid: 0 group:
>>>>>> root gid: 0
>>>>>>
>> sge_o_home: /root
>>>>>> sge_o_log_name: root sge_o_path:
>>>>>> /home/apps/bin:/home/apps/vcftools_0.1.7/bin:/home/apps/tabix-0.2.5:/home/apps/BEDTools-Version-2.14.2/bin:/home/apps/samtools/bcftools:/home/apps/samtools:/home/apps/bwa-0.5.9:/home/apps/Python-2.7.2:/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/bin:/home/apps/sjm-1.0/bin:/home/apps/hugeseq/bin:/usr/lib64/openmpi/1.4-gcc/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/cuda/bin:/usr/local/cuda/computeprof/bin:/usr/local/cuda/open64/bin:/opt/sge6/bin/lx24-amd64:/root/bin
>>>>>>
>>>>>>
>>>>>>
sge_o_shell: /bin/bash
>>>>>> sge_o_workdir:
>> /data/test
>>>>>> sge_o_host: master account: sge
>>>>>> stderr_path_list:
>>>>>> NONE:master:/data/log/SAMPLE.bin_aln-chr1_e111108173139.txt
>>>>>>
>>>>>>
>>>>>>
*hard resource_list: h_vmem=12000M*
>>>>>> mail_list: root at master notify: FALSE
>>>>>> job_name: SAMPLE.bin_aln-chr1
>>>>>> stdout_path_list:
>>>>>> NONE:master:/data/log/SAMPLE.bin_aln-chr1_o111108173139.txt
>>>>>>
>>>>>>
>>>>>>
jobshare:
>> 0
>>>>>> hard_queue_list: all.q env_list: job_args:
>>>>>> -c,/home/apps/hugeseq/bin/hugeseq_mod.sh bin_sam.sh chr1
>>>>>> /data/chr1.bam /data/bwa_small.bam &&
>>>>>> /home/apps/hugeseq/bin/hugeseq_mod.sh sam_index.sh
>>>>>> /data/chr1.bam script_file: /bin/sh
>>>>>> verify_suitable_queues: 2 scheduling info:
>>>>>> (Collecting of scheduler job information is turned off)
>>>>>>
>>>>>> And I'm using the Cluster GPU Quadruple Extra Large
>>>>>> instances which
>> I
>>>>>> think has about 23G memory. The issue that I see is too
>>>>>> many of the jobs are submitted. I guess I need to set
>>>>>> the mem_free too? (the problem is the tool im using does
>>>>>> not seem to have a way tot set that...)
>>>>>>
>>>>>> Many thanks, Amir
>>>>>>
>>>>>> On Nov 8, 2011, at 5:47 AM, Justin Riley wrote:
>>>>>>
>>>>>>>
>>>>> Hi Amirhossein,
>>>>>
>>>>> Did you specify the memory usage in your job script or at
>>>>> command line and what parameters did you use exactly?
>>>>>
>>>>> Doing a quick search I believe that the following will
>>>>> solve the problem although I haven't tested myself:
>>>>>
>>>>> $ qsub -l mem_free=MEM_NEEDED,h_vmem=MEM_MAX yourjob.sh
>>>>>
>>>>> Here, MEM_NEEDED and MEM_MAX are the lower and
>> upper bounds for your
>>>>> job's memory requirements.
>>>>>
>>>>> HTH,
>>>>>
>>>>> ~Justin
>>>>>
>>>>> On 7/22/64 2:59 PM, Amirhossein Kiani wrote:
>>>>>> Dear Star Cluster users,
>>>>>
>>>>>> I'm using Star Cluster to set up an SGE and when I ran
>>>>>> my job list,
>>>>> although I had specified the memory usage for each job, it
>>>>> submitted too many jobs on my instance and my instance
>>>>> started going out of memory and swapping.
>>>>>> I wonder if anyone knows how I could tell the SGE the
>>>>>> max memory to
>>>>> consider when submitting jobs to each node so that it
>>>>> doesn't run the jobs if there is not enough memory
>>>>> available on a node.
>>>>>
>>>>>> I'm using the Cluster GPU Quadruple Extra Large
>>>>>> instances.
>>>>>
>>>>>> Many thanks, Amirhossein Kiani
>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> StarCluster mailing list StarCluster at mit.edu
>>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>>
>>>
>>>
>>> _______________________________________________ StarCluster
>>> mailing list StarCluster at mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>
>>>
>>>
>
>
> _______________________________________________ StarCluster
> mailing list StarCluster at mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAk7dT/wACgkQ4llAkMfDcrkhtwCeI6G0tPeUnXsfZs5uXbdj6IR4
rE8An1UzMLiKVWOFLXdaVvMKdkw/RPO7
=O30r
-----END PGP SIGNATURE-----
More information about the StarCluster
mailing list