[StarCluster] AWS instance runs out of memory and swaps

Mon Dec 5 18:16:13 EST 2011

Thanks Justin... I think the issue was I had "sudo su" 'ed on the instance and qconf was not on the roots path...
I teared down my cluster and creating a new one...

On Dec 5, 2011, at 3:13 PM, Justin Riley wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Amir,
> 
> qconf is included in the StarCluster AMIs so there must be some other
> issue you're facing. Also, I wouldn't recommend installing the
> gridengine packages from ubuntu as they're most likely not compatible
> with StarCluster's bundled version in /opt/sge6 as you're seeing.
> 
> With that said which AMI are you using and what does "echo $PATH" look
> like when you login as root (via sshmaster)?
> 
> ~Justin
> 
> 
> On 12/05/2011 06:07 PM, Amirhossein Kiani wrote:
>> So I tried this and couldn't run qconf because it was not 
>> installed. I then tried installing it using apt-get and specified 
>> default for the cell name and "master" for the master name which
>> is the default for the SGE created using StarCluster.
>> 
>> However now when I want to use qconf, it says:
>> 
>> root at master:/data/stanford/aligned# qconf -msconf error: commlib 
>> error: got select error (Connection refused) unable to send
>> message to qmaster using port 6444 on host "master": got send
>> error
>> 
>> 
>> Any idea how i could configure it to work?
>> 
>> 
>> Many thanks, Amir
>> 
>> On Dec 5, 2011, at 1:52 PM, Rayson Ho wrote:
>> 
>>> Hi Amirhossein,
>>> 
>>> I was working on a few other things, and I just saw your message 
>>> -- I have to spend less time on mailing list discussions these 
>>> days due to the number of things that I needed to develop and/or 
>>> fix, and I am also working on a new patch release of OGS/Grid 
>>> Engine 2011.11. Luckily, I just found the mail that exactly 
>>> solves the issue you are encountering:
>>> 
>>> http://markmail.org/message/zdj5ebfrzhnadglf
>>> 
>>> 
>>> For more info, see the "job_load_adjustments" and 
>>> "load_adjustment_decay_time" parameters in the Grid Engine 
>>> manpage:
>>> 
>>> 
>>> http://gridscheduler.sourceforge.net/htmlman/htmlman5/sched_conf.html
>>> 
>>> 
>>> 
>>> 
> Rayson
>>> 
>>> ================================= Grid Engine / Open Grid 
>>> Scheduler http://gridscheduler.sourceforge.net/
>>> 
>>> Scalable Grid Engine Support Program 
>>> http://www.scalablelogic.com/
>>> 
>>> 
>>> 
>>> 
>>> ________________________________ From: Amirhossein Kiani 
>>> <amirhkiani at gmail.com> To: Rayson Ho <raysonlogin at yahoo.com> Cc: 
>>> Justin Riley <justin.t.riley at gmail.com>; "starcluster at mit.edu" 
>>> <starcluster at mit.edu> Sent: Friday, December 2, 2011 6:36 PM 
>>> Subject: Re: [StarCluster] AWS instance runs out of memory and 
>>> swaps
>>> 
>>> 
>>> Dear Rayson,
>>> 
>>> Did you have a chance to test your solution on this? Basically, 
>>> all I want is to prevent a job from running on an instance if it 
>>> does not have the memory required for the job.
>>> 
>>> I would very much appreciate your help!
>>> 
>>> Many thanks, Amir
>>> 
>>> 
>>> 
>>> On Nov 21, 2011, at 10:29 AM, Rayson Ho wrote:
>>> 
>>> Amir,
>>>> 
>>>> 
>>>> You can use qhost to list all the node and resources that each 
>>>> node has.
>>>> 
>>>> 
>>>> I have an answer to the memory issue, but I have not have time 
>>>> to properly type up a response and test it.
>>>> 
>>>> 
>>>> 
>>>> Rayson
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> ________________________________ From: Amirhossein Kiani 
>>>> <amirhkiani at gmail.com> To: Justin Riley 
>>>> <justin.t.riley at gmail.com> Cc: Rayson Ho
>>>> <rayrayson at gmail.com>; "starcluster at mit.edu"
>>>> <starcluster at mit.edu> Sent: Monday, November 21, 2011 1:26 PM
>>>> Subject: Re: [StarCluster] AWS instance runs out of memory and
>>>> swaps
>>>> 
>>>> Hi Justin,
>>>> 
>>>> Many thanks for your reply. I don't have any issue with 
>>>> multiple jobs running per node if there is enough memory for 
>>>> them. But since I know about the nature of my jobs, I can 
>>>> predict that only one per node should be running. How can I
>>>> see how much memory does SGE think each node have? Is there a
>>>> way to list that?
>>>> 
>>>> Regards, Amir
>>>> 
>>>> 
>>>> On Nov 21, 2011, at 8:18 AM, Justin Riley wrote:
>>>> 
>>>>> Hi Amir,
>>>>> 
>>>>> Sorry to hear you're still having issues. This is really
>>>>> more of an SGE issue more than anything but perhaps Rayson
>>>>> can give a better insight as to what's going on. It seems
>>>>> you're using 23G nodes and 12GB jobs. Just for drill does
>>>>> 'qhost' show each node having 23GB? Definitely seems like
>>>>> there's a boundary issue here given that two of your jobs
>>>>> together approaches the total memory of the machine (23GB).
>>>>> Is it your goal only to have one job per
>>> node?
>>>>> 
>>>>> ~Justin
>>>>> 
>>>>> On 11/16/2011 09:00 PM, Amirhossein Kiani wrote:
>>>>>> Dear all,
>>>>>> 
>>>>>> I even wrote the queue submission script myself, adding
>>>>>> the mem_free=MEM_NEEDED,h_vmem=MEM_MAX parameter but
>>>>>> sometimes two jobs are randomly sent to one node that does
>>>>>> not have enough memory for two jobs and they start running.
>>>>>> I think the SGE should check on the instance memory and not
>>>>>> run multiple jobs on a machine when the memory requirement
>>>>>> for the jobs in total is above the memory available in the
>>>>>> node (or maybe there is a bug in the current check)
>>>>>> 
>>>>>> Amir
>>>>>> 
>>>>>> On Nov 8, 2011, at 5:37 PM, Amirhossein Kiani wrote:
>>>>>> 
>>>>>>> Hi Justin,
>>>>>>> 
>>>>>>> I'm using a third-party tool to submit the jobs but I am 
>>>>>>> setting the hard
>>> limit.
>>>>>>> For all my jobs I have something like this for the job 
>>>>>>> description:
>>>>>>> 
>>>>>>> [root at master test]# qstat -j 1 
>>>>>>> ==============================================================
>>>>>>> 
>>>>>>> 
>>>>>>> 
> job_number:                 1
>>>>>>> exec_file:                  job_scripts/1 
>>>>>>> submission_time:            Tue Nov  8 17:31:39 2011 
>>>>>>> owner:                      root uid: 0 group:
>>>>>>> root gid: 0
>>>>>>> 
>>> sge_o_home:                 /root
>>>>>>> sge_o_log_name:             root sge_o_path: 
>>>>>>> /home/apps/bin:/home/apps/vcftools_0.1.7/bin:/home/apps/tabix-0.2.5:/home/apps/BEDTools-Version-2.14.2/bin:/home/apps/samtools/bcftools:/home/apps/samtools:/home/apps/bwa-0.5.9:/home/apps/Python-2.7.2:/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/bin:/home/apps/sjm-1.0/bin:/home/apps/hugeseq/bin:/usr/lib64/openmpi/1.4-gcc/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/cuda/bin:/usr/local/cuda/computeprof/bin:/usr/local/cuda/open64/bin:/opt/sge6/bin/lx24-amd64:/root/bin
>>>>>>> 
>>>>>>> 
>>>>>>> 
> sge_o_shell:                /bin/bash
>>>>>>> sge_o_workdir:
>>> /data/test
>>>>>>> sge_o_host:                 master account: sge
>>>>>>> stderr_path_list: 
>>>>>>> NONE:master:/data/log/SAMPLE.bin_aln-chr1_e111108173139.txt
>>>>>>> 
>>>>>>> 
>>>>>>> 
> *hard resource_list:         h_vmem=12000M*
>>>>>>> mail_list:                  root at master notify: FALSE
>>>>>>> job_name:                   SAMPLE.bin_aln-chr1 
>>>>>>> stdout_path_list: 
>>>>>>> NONE:master:/data/log/SAMPLE.bin_aln-chr1_o111108173139.txt
>>>>>>> 
>>>>>>> 
>>>>>>> 
> jobshare:
>>> 0
>>>>>>> hard_queue_list:            all.q env_list: job_args: 
>>>>>>> -c,/home/apps/hugeseq/bin/hugeseq_mod.sh bin_sam.sh chr1 
>>>>>>> /data/chr1.bam /data/bwa_small.bam && 
>>>>>>> /home/apps/hugeseq/bin/hugeseq_mod.sh sam_index.sh 
>>>>>>> /data/chr1.bam script_file:                /bin/sh 
>>>>>>> verify_suitable_queues:     2 scheduling info: 
>>>>>>> (Collecting of scheduler job information is turned off)
>>>>>>> 
>>>>>>> And I'm using the Cluster GPU Quadruple Extra Large 
>>>>>>> instances which
>>> I
>>>>>>> think has about 23G memory. The issue that I see is too 
>>>>>>> many of the jobs are submitted. I guess I need to set
>>>>>>> the mem_free too? (the problem is the tool im using does
>>>>>>> not seem to have a way tot set that...)
>>>>>>> 
>>>>>>> Many thanks, Amir
>>>>>>> 
>>>>>>> On Nov 8, 2011, at 5:47 AM, Justin Riley wrote:
>>>>>>> 
>>>>>>>> 
>>>>>> Hi Amirhossein,
>>>>>> 
>>>>>> Did you specify the memory usage in your job script or at 
>>>>>> command line and what parameters did you use exactly?
>>>>>> 
>>>>>> Doing a quick search I believe that the following will 
>>>>>> solve the problem although I haven't tested myself:
>>>>>> 
>>>>>> $ qsub -l mem_free=MEM_NEEDED,h_vmem=MEM_MAX yourjob.sh
>>>>>> 
>>>>>> Here, MEM_NEEDED and MEM_MAX are the lower and
>>> upper bounds for your
>>>>>> job's memory requirements.
>>>>>> 
>>>>>> HTH,
>>>>>> 
>>>>>> ~Justin
>>>>>> 
>>>>>> On 7/22/64 2:59 PM, Amirhossein Kiani wrote:
>>>>>>> Dear Star Cluster users,
>>>>>> 
>>>>>>> I'm using Star Cluster to set up an SGE and when I ran
>>>>>>> my job list,
>>>>>> although I had specified the memory usage for each job, it 
>>>>>> submitted too many jobs on my instance and my instance 
>>>>>> started going out of memory and swapping.
>>>>>>> I wonder if anyone knows how I could tell the SGE the
>>>>>>> max memory to
>>>>>> consider when submitting jobs to each node so that it 
>>>>>> doesn't run the jobs if there is not enough memory 
>>>>>> available on a node.
>>>>>> 
>>>>>>> I'm using the Cluster GPU Quadruple Extra Large 
>>>>>>> instances.
>>>>>> 
>>>>>>> Many thanks, Amirhossein Kiani
>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> StarCluster mailing list StarCluster at mit.edu 
>>>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>>> 
>>>> 
>>>> 
>>>> _______________________________________________ StarCluster 
>>>> mailing list StarCluster at mit.edu 
>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>> 
>>>> 
>>>> 
>> 
>> 
>> _______________________________________________ StarCluster
>> mailing list StarCluster at mit.edu 
>> http://mailman.mit.edu/mailman/listinfo/starcluster
> 
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2.0.17 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
> 
> iEYEARECAAYFAk7dT/wACgkQ4llAkMfDcrkhtwCeI6G0tPeUnXsfZs5uXbdj6IR4
> rE8An1UzMLiKVWOFLXdaVvMKdkw/RPO7
> =O30r
> -----END PGP SIGNATURE-----