[StarCluster] AWS instance runs out of memory and swaps

Mon Dec 5 19:07:10 EST 2011

This finally solved my problem. Thank you Rayson and Justin!

Amir

On Dec 5, 2011, at 3:14 PM, Justin Riley wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Also, just for drill, try running:
> 
> $ source /etc/profile
> 
> before trying qconf in case your environment was altered at some point
> in your SSH session.
> 
> ~Justin
> 
> On 12/05/2011 06:13 PM, Justin Riley wrote:
>> Amir,
>> 
>> qconf is included in the StarCluster AMIs so there must be some
>> other issue you're facing. Also, I wouldn't recommend installing
>> the gridengine packages from ubuntu as they're most likely not
>> compatible with StarCluster's bundled version in /opt/sge6 as
>> you're seeing.
>> 
>> With that said which AMI are you using and what does "echo $PATH"
>> look like when you login as root (via sshmaster)?
>> 
>> ~Justin
>> 
>> 
>> On 12/05/2011 06:07 PM, Amirhossein Kiani wrote:
>>> So I tried this and couldn't run qconf because it was not 
>>> installed. I then tried installing it using apt-get and
>>> specified default for the cell name and "master" for the master
>>> name which is the default for the SGE created using StarCluster.
>> 
>>> However now when I want to use qconf, it says:
>> 
>>> root at master:/data/stanford/aligned# qconf -msconf error: commlib 
>>> error: got select error (Connection refused) unable to send 
>>> message to qmaster using port 6444 on host "master": got send 
>>> error
>> 
>> 
>>> Any idea how i could configure it to work?
>> 
>> 
>>> Many thanks, Amir
>> 
>>> On Dec 5, 2011, at 1:52 PM, Rayson Ho wrote:
>> 
>>>> Hi Amirhossein,
>>>> 
>>>> I was working on a few other things, and I just saw your
>>>> message -- I have to spend less time on mailing list
>>>> discussions these days due to the number of things that I
>>>> needed to develop and/or fix, and I am also working on a new
>>>> patch release of OGS/Grid Engine 2011.11. Luckily, I just found
>>>> the mail that exactly solves the issue you are encountering:
>>>> 
>>>> http://markmail.org/message/zdj5ebfrzhnadglf
>>>> 
>>>> 
>>>> For more info, see the "job_load_adjustments" and 
>>>> "load_adjustment_decay_time" parameters in the Grid Engine 
>>>> manpage:
>>>> 
>>>> 
>>>> http://gridscheduler.sourceforge.net/htmlman/htmlman5/sched_conf.html
>>>> 
>>>> 
>>>> 
>>>> 
>> 
>>>> 
> Rayson
>>>> 
>>>> ================================= Grid Engine / Open Grid 
>>>> Scheduler http://gridscheduler.sourceforge.net/
>>>> 
>>>> Scalable Grid Engine Support Program 
>>>> http://www.scalablelogic.com/
>>>> 
>>>> 
>>>> 
>>>> 
>>>> ________________________________ From: Amirhossein Kiani 
>>>> <amirhkiani at gmail.com> To: Rayson Ho <raysonlogin at yahoo.com>
>>>> Cc: Justin Riley <justin.t.riley at gmail.com>;
>>>> "starcluster at mit.edu" <starcluster at mit.edu> Sent: Friday,
>>>> December 2, 2011 6:36 PM Subject: Re: [StarCluster] AWS
>>>> instance runs out of memory and swaps
>>>> 
>>>> 
>>>> Dear Rayson,
>>>> 
>>>> Did you have a chance to test your solution on this?
>>>> Basically, all I want is to prevent a job from running on an
>>>> instance if it does not have the memory required for the job.
>>>> 
>>>> I would very much appreciate your help!
>>>> 
>>>> Many thanks, Amir
>>>> 
>>>> 
>>>> 
>>>> On Nov 21, 2011, at 10:29 AM, Rayson Ho wrote:
>>>> 
>>>> Amir,
>>>>> 
>>>>> 
>>>>> You can use qhost to list all the node and resources that
>>>>> each node has.
>>>>> 
>>>>> 
>>>>> I have an answer to the memory issue, but I have not have
>>>>> time to properly type up a response and test it.
>>>>> 
>>>>> 
>>>>> 
>>>>> Rayson
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> ________________________________ From: Amirhossein Kiani 
>>>>> <amirhkiani at gmail.com> To: Justin Riley 
>>>>> <justin.t.riley at gmail.com> Cc: Rayson Ho 
>>>>> <rayrayson at gmail.com>; "starcluster at mit.edu" 
>>>>> <starcluster at mit.edu> Sent: Monday, November 21, 2011 1:26
>>>>> PM Subject: Re: [StarCluster] AWS instance runs out of memory
>>>>> and swaps
>>>>> 
>>>>> Hi Justin,
>>>>> 
>>>>> Many thanks for your reply. I don't have any issue with 
>>>>> multiple jobs running per node if there is enough memory for 
>>>>> them. But since I know about the nature of my jobs, I can 
>>>>> predict that only one per node should be running. How can I 
>>>>> see how much memory does SGE think each node have? Is there
>>>>> a way to list that?
>>>>> 
>>>>> Regards, Amir
>>>>> 
>>>>> 
>>>>> On Nov 21, 2011, at 8:18 AM, Justin Riley wrote:
>>>>> 
>>>>>> Hi Amir,
>>>>>> 
>>>>>> Sorry to hear you're still having issues. This is really 
>>>>>> more of an SGE issue more than anything but perhaps Rayson 
>>>>>> can give a better insight as to what's going on. It seems 
>>>>>> you're using 23G nodes and 12GB jobs. Just for drill does 
>>>>>> 'qhost' show each node having 23GB? Definitely seems like 
>>>>>> there's a boundary issue here given that two of your jobs 
>>>>>> together approaches the total memory of the machine
>>>>>> (23GB). Is it your goal only to have one job per
>>>> node?
>>>>>> 
>>>>>> ~Justin
>>>>>> 
>>>>>> On 11/16/2011 09:00 PM, Amirhossein Kiani wrote:
>>>>>>> Dear all,
>>>>>>> 
>>>>>>> I even wrote the queue submission script myself, adding 
>>>>>>> the mem_free=MEM_NEEDED,h_vmem=MEM_MAX parameter but 
>>>>>>> sometimes two jobs are randomly sent to one node that
>>>>>>> does not have enough memory for two jobs and they start
>>>>>>> running. I think the SGE should check on the instance
>>>>>>> memory and not run multiple jobs on a machine when the
>>>>>>> memory requirement for the jobs in total is above the
>>>>>>> memory available in the node (or maybe there is a bug in
>>>>>>> the current check)
>>>>>>> 
>>>>>>> Amir
>>>>>>> 
>>>>>>> On Nov 8, 2011, at 5:37 PM, Amirhossein Kiani wrote:
>>>>>>> 
>>>>>>>> Hi Justin,
>>>>>>>> 
>>>>>>>> I'm using a third-party tool to submit the jobs but I
>>>>>>>> am setting the hard
>>>> limit.
>>>>>>>> For all my jobs I have something like this for the job 
>>>>>>>> description:
>>>>>>>> 
>>>>>>>> [root at master test]# qstat -j 1 
>>>>>>>> ==============================================================
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>> 
>>>>>>>> 
> job_number:                 1
>>>>>>>> exec_file:                  job_scripts/1 
>>>>>>>> submission_time:            Tue Nov  8 17:31:39 2011 
>>>>>>>> owner:                      root uid: 0 group: root
>>>>>>>> gid: 0
>>>>>>>> 
>>>> sge_o_home:                 /root
>>>>>>>> sge_o_log_name:             root sge_o_path: 
>>>>>>>> /home/apps/bin:/home/apps/vcftools_0.1.7/bin:/home/apps/tabix-0.2.5:/home/apps/BEDTools-Version-2.14.2/bin:/home/apps/samtools/bcftools:/home/apps/samtools:/home/apps/bwa-0.5.9:/home/apps/Python-2.7.2:/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/bin:/home/apps/sjm-1.0/bin:/home/apps/hugeseq/bin:/usr/lib64/openmpi/1.4-gcc/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/cuda/bin:/usr/local/cuda/computeprof/bin:/usr/local/cuda/open64/bin:/opt/sge6/bin/lx24-amd64:/root/bin
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>> 
>>>>>>>> 
> sge_o_shell:                /bin/bash
>>>>>>>> sge_o_workdir:
>>>> /data/test
>>>>>>>> sge_o_host:                 master account: sge 
>>>>>>>> stderr_path_list: 
>>>>>>>> NONE:master:/data/log/SAMPLE.bin_aln-chr1_e111108173139.txt
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>> 
>>>>>>>> 
> *hard resource_list:         h_vmem=12000M*
>>>>>>>> mail_list:                  root at master notify: FALSE 
>>>>>>>> job_name:                   SAMPLE.bin_aln-chr1 
>>>>>>>> stdout_path_list: 
>>>>>>>> NONE:master:/data/log/SAMPLE.bin_aln-chr1_o111108173139.txt
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>> 
>>>>>>>> 
> jobshare:
>>>> 0
>>>>>>>> hard_queue_list:            all.q env_list: job_args: 
>>>>>>>> -c,/home/apps/hugeseq/bin/hugeseq_mod.sh bin_sam.sh
>>>>>>>> chr1 /data/chr1.bam /data/bwa_small.bam && 
>>>>>>>> /home/apps/hugeseq/bin/hugeseq_mod.sh sam_index.sh 
>>>>>>>> /data/chr1.bam script_file:                /bin/sh 
>>>>>>>> verify_suitable_queues:     2 scheduling info: 
>>>>>>>> (Collecting of scheduler job information is turned
>>>>>>>> off)
>>>>>>>> 
>>>>>>>> And I'm using the Cluster GPU Quadruple Extra Large 
>>>>>>>> instances which
>>>> I
>>>>>>>> think has about 23G memory. The issue that I see is
>>>>>>>> too many of the jobs are submitted. I guess I need to
>>>>>>>> set the mem_free too? (the problem is the tool im using
>>>>>>>> does not seem to have a way tot set that...)
>>>>>>>> 
>>>>>>>> Many thanks, Amir
>>>>>>>> 
>>>>>>>> On Nov 8, 2011, at 5:47 AM, Justin Riley wrote:
>>>>>>>> 
>>>>>>>>> 
>>>>>>> Hi Amirhossein,
>>>>>>> 
>>>>>>> Did you specify the memory usage in your job script or
>>>>>>> at command line and what parameters did you use exactly?
>>>>>>> 
>>>>>>> Doing a quick search I believe that the following will 
>>>>>>> solve the problem although I haven't tested myself:
>>>>>>> 
>>>>>>> $ qsub -l mem_free=MEM_NEEDED,h_vmem=MEM_MAX yourjob.sh
>>>>>>> 
>>>>>>> Here, MEM_NEEDED and MEM_MAX are the lower and
>>>> upper bounds for your
>>>>>>> job's memory requirements.
>>>>>>> 
>>>>>>> HTH,
>>>>>>> 
>>>>>>> ~Justin
>>>>>>> 
>>>>>>> On 7/22/64 2:59 PM, Amirhossein Kiani wrote:
>>>>>>>> Dear Star Cluster users,
>>>>>>> 
>>>>>>>> I'm using Star Cluster to set up an SGE and when I ran 
>>>>>>>> my job list,
>>>>>>> although I had specified the memory usage for each job,
>>>>>>> it submitted too many jobs on my instance and my
>>>>>>> instance started going out of memory and swapping.
>>>>>>>> I wonder if anyone knows how I could tell the SGE the 
>>>>>>>> max memory to
>>>>>>> consider when submitting jobs to each node so that it 
>>>>>>> doesn't run the jobs if there is not enough memory 
>>>>>>> available on a node.
>>>>>>> 
>>>>>>>> I'm using the Cluster GPU Quadruple Extra Large 
>>>>>>>> instances.
>>>>>>> 
>>>>>>>> Many thanks, Amirhossein Kiani
>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________ 
>>>>>>> StarCluster mailing list StarCluster at mit.edu 
>>>>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________ StarCluster 
>>>>> mailing list StarCluster at mit.edu 
>>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>>> 
>>>>> 
>>>>> 
>> 
>> 
>>> _______________________________________________ StarCluster 
>>> mailing list StarCluster at mit.edu 
>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>> 
>> _______________________________________________ StarCluster mailing
>> list StarCluster at mit.edu 
>> http://mailman.mit.edu/mailman/listinfo/starcluster
> 
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2.0.17 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
> 
> iEYEARECAAYFAk7dUGwACgkQ4llAkMfDcrnxJgCggl/bdu/yC5LurXC5ybgHCYTa
> d4IAnAmMgw/Qo2f7p/o5aoZyHOLwNFrh
> =2OxU
> -----END PGP SIGNATURE-----

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20111205/c63ff851/attachment.htm