<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<br>
-----BEGIN PGP SIGNED MESSAGE-----<br>
Hash: SHA1<br>
<br>
OK, in this case running 'source /etc/profile' should fix the issue
if it ever happens again - no need to terminate the cluster.<br>
<br>
In any event, glad you got things working. Would you mind sharing
the exact settings/procedures you used to fix the issue? This should
probably be tunable from StarCluster...<br>
<br>
~Justin<br>
<br>
<br>
On 12/5/11 6:16 PM, Amirhossein Kiani wrote:<br>
<span style="white-space: pre;">> Thanks Justin... I think the
issue was I had "sudo su" 'ed on the instance and qconf was not on
the roots path...<br>
> I teared down my cluster and creating a new one...<br>
><br>
> On Dec 5, 2011, at 3:13 PM, Justin Riley wrote:<br>
><br>
> Amir,<br>
><br>
> qconf is included in the StarCluster AMIs so there must be
some other<br>
> issue you're facing. Also, I wouldn't recommend installing
the<br>
> gridengine packages from ubuntu as they're most likely not
compatible<br>
> with StarCluster's bundled version in /opt/sge6 as you're
seeing.<br>
><br>
> With that said which AMI are you using and what does "echo
$PATH" look<br>
> like when you login as root (via sshmaster)?<br>
><br>
> ~Justin<br>
><br>
><br>
> On 12/05/2011 06:07 PM, Amirhossein Kiani wrote:<br>
> >>> So I tried this and couldn't run qconf because
it was not<br>
> >>> installed. I then tried installing it using
apt-get and specified<br>
> >>> default for the cell name and "master" for the
master name which<br>
> >>> is the default for the SGE created using
StarCluster.<br>
> >>><br>
> >>> However now when I want to use qconf, it says:<br>
> >>><br>
> >>> root@master:/data/stanford/aligned# qconf
-msconf error: commlib<br>
> >>> error: got select error (Connection refused)
unable to send<br>
> >>> message to qmaster using port 6444 on host
"master": got send<br>
> >>> error<br>
> >>><br>
> >>><br>
> >>> Any idea how i could configure it to work?<br>
> >>><br>
> >>><br>
> >>> Many thanks, Amir<br>
> >>><br>
> >>> On Dec 5, 2011, at 1:52 PM, Rayson Ho wrote:<br>
> >>><br>
> >>>> Hi Amirhossein,<br>
> >>>><br>
> >>>> I was working on a few other things, and I
just saw your message<br>
> >>>> -- I have to spend less time on mailing list
discussions these<br>
> >>>> days due to the number of things that I
needed to develop and/or<br>
> >>>> fix, and I am also working on a new patch
release of OGS/Grid<br>
> >>>> Engine 2011.11. Luckily, I just found the
mail that exactly<br>
> >>>> solves the issue you are encountering:<br>
> >>>><br>
> >>>> <a class="moz-txt-link-freetext" href="http://markmail.org/message/zdj5ebfrzhnadglf">http://markmail.org/message/zdj5ebfrzhnadglf</a><br>
> >>>><br>
> >>>><br>
> >>>> For more info, see the
"job_load_adjustments" and<br>
> >>>> "load_adjustment_decay_time" parameters in
the Grid Engine<br>
> >>>> manpage:<br>
> >>>><br>
> >>>><br>
> >>>>
<a class="moz-txt-link-freetext" href="http://gridscheduler.sourceforge.net/htmlman/htmlman5/sched_conf.html">http://gridscheduler.sourceforge.net/htmlman/htmlman5/sched_conf.html</a><br>
> >>>><br>
> >>>><br>
> >>>><br>
> >>>><br>
> Rayson<br>
> >>>><br>
> >>>> ================================= Grid
Engine / Open Grid<br>
> >>>> Scheduler
<a class="moz-txt-link-freetext" href="http://gridscheduler.sourceforge.net/">http://gridscheduler.sourceforge.net/</a><br>
> >>>><br>
> >>>> Scalable Grid Engine Support Program<br>
> >>>> <a class="moz-txt-link-freetext" href="http://www.scalablelogic.com/">http://www.scalablelogic.com/</a><br>
> >>>><br>
> >>>><br>
> >>>><br>
> >>>><br>
> >>>> ________________________________ From:
Amirhossein Kiani<br>
> >>>> <a class="moz-txt-link-rfc2396E" href="mailto:amirhkiani@gmail.com"><amirhkiani@gmail.com></a> To: Rayson Ho
<a class="moz-txt-link-rfc2396E" href="mailto:raysonlogin@yahoo.com"><raysonlogin@yahoo.com></a> Cc:<br>
> >>>> Justin Riley
<a class="moz-txt-link-rfc2396E" href="mailto:justin.t.riley@gmail.com"><justin.t.riley@gmail.com></a>; <a class="moz-txt-link-rfc2396E" href="mailto:starcluster@mit.edu">"starcluster@mit.edu"</a><br>
> >>>> <a class="moz-txt-link-rfc2396E" href="mailto:starcluster@mit.edu"><starcluster@mit.edu></a> Sent: Friday,
December 2, 2011 6:36 PM<br>
> >>>> Subject: Re: [StarCluster] AWS instance runs
out of memory and<br>
> >>>> swaps<br>
> >>>><br>
> >>>><br>
> >>>> Dear Rayson,<br>
> >>>><br>
> >>>> Did you have a chance to test your solution
on this? Basically,<br>
> >>>> all I want is to prevent a job from running
on an instance if it<br>
> >>>> does not have the memory required for the
job.<br>
> >>>><br>
> >>>> I would very much appreciate your help!<br>
> >>>><br>
> >>>> Many thanks, Amir<br>
> >>>><br>
> >>>><br>
> >>>><br>
> >>>> On Nov 21, 2011, at 10:29 AM, Rayson Ho
wrote:<br>
> >>>><br>
> >>>> Amir,<br>
> >>>>><br>
> >>>>><br>
> >>>>> You can use qhost to list all the node
and resources that each<br>
> >>>>> node has.<br>
> >>>>><br>
> >>>>><br>
> >>>>> I have an answer to the memory issue,
but I have not have time<br>
> >>>>> to properly type up a response and test
it.<br>
> >>>>><br>
> >>>>><br>
> >>>>><br>
> >>>>> Rayson<br>
> >>>>><br>
> >>>>><br>
> >>>>><br>
> >>>>><br>
> >>>>><br>
> >>>>><br>
> >>>>><br>
> >>>>> ________________________________ From:
Amirhossein Kiani<br>
> >>>>> <a class="moz-txt-link-rfc2396E" href="mailto:amirhkiani@gmail.com"><amirhkiani@gmail.com></a> To: Justin
Riley<br>
> >>>>> <a class="moz-txt-link-rfc2396E" href="mailto:justin.t.riley@gmail.com"><justin.t.riley@gmail.com></a> Cc:
Rayson Ho<br>
> >>>>> <a class="moz-txt-link-rfc2396E" href="mailto:rayrayson@gmail.com"><rayrayson@gmail.com></a>;
<a class="moz-txt-link-rfc2396E" href="mailto:starcluster@mit.edu">"starcluster@mit.edu"</a><br>
> >>>>> <a class="moz-txt-link-rfc2396E" href="mailto:starcluster@mit.edu"><starcluster@mit.edu></a> Sent:
Monday, November 21, 2011 1:26 PM<br>
> >>>>> Subject: Re: [StarCluster] AWS instance
runs out of memory and<br>
> >>>>> swaps<br>
> >>>>><br>
> >>>>> Hi Justin,<br>
> >>>>><br>
> >>>>> Many thanks for your reply. I don't have
any issue with<br>
> >>>>> multiple jobs running per node if there
is enough memory for<br>
> >>>>> them. But since I know about the nature
of my jobs, I can<br>
> >>>>> predict that only one per node should be
running. How can I<br>
> >>>>> see how much memory does SGE think each
node have? Is there a<br>
> >>>>> way to list that?<br>
> >>>>><br>
> >>>>> Regards, Amir<br>
> >>>>><br>
> >>>>><br>
> >>>>> On Nov 21, 2011, at 8:18 AM, Justin
Riley wrote:<br>
> >>>>><br>
> >>>>>> Hi Amir,<br>
> >>>>>><br>
> >>>>>> Sorry to hear you're still having
issues. This is really<br>
> >>>>>> more of an SGE issue more than
anything but perhaps Rayson<br>
> >>>>>> can give a better insight as to
what's going on. It seems<br>
> >>>>>> you're using 23G nodes and 12GB
jobs. Just for drill does<br>
> >>>>>> 'qhost' show each node having 23GB?
Definitely seems like<br>
> >>>>>> there's a boundary issue here given
that two of your jobs<br>
> >>>>>> together approaches the total memory
of the machine (23GB).<br>
> >>>>>> Is it your goal only to have one job
per<br>
> >>>> node?<br>
> >>>>>><br>
> >>>>>> ~Justin<br>
> >>>>>><br>
> >>>>>> On 11/16/2011 09:00 PM, Amirhossein
Kiani wrote:<br>
> >>>>>>> Dear all,<br>
> >>>>>>><br>
> >>>>>>> I even wrote the queue
submission script myself, adding<br>
> >>>>>>> the
mem_free=MEM_NEEDED,h_vmem=MEM_MAX parameter but<br>
> >>>>>>> sometimes two jobs are randomly
sent to one node that does<br>
> >>>>>>> not have enough memory for two
jobs and they start running.<br>
> >>>>>>> I think the SGE should check on
the instance memory and not<br>
> >>>>>>> run multiple jobs on a machine
when the memory requirement<br>
> >>>>>>> for the jobs in total is above
the memory available in the<br>
> >>>>>>> node (or maybe there is a bug in
the current check)<br>
> >>>>>>><br>
> >>>>>>> Amir<br>
> >>>>>>><br>
> >>>>>>> On Nov 8, 2011, at 5:37 PM,
Amirhossein Kiani wrote:<br>
> >>>>>>><br>
> >>>>>>>> Hi Justin,<br>
> >>>>>>>><br>
> >>>>>>>> I'm using a third-party tool
to submit the jobs but I am<br>
> >>>>>>>> setting the hard<br>
> >>>> limit.<br>
> >>>>>>>> For all my jobs I have
something like this for the job<br>
> >>>>>>>> description:<br>
> >>>>>>>><br>
> >>>>>>>> [root@master test]# qstat -j
1<br>
> >>>>>>>>
==============================================================<br>
> >>>>>>>><br>
> >>>>>>>><br>
> >>>>>>>><br>
> job_number: 1<br>
> >>>>>>>> exec_file: job_scripts/1<br>
> >>>>>>>> submission_time: Tue Nov 8
17:31:39 2011<br>
> >>>>>>>> owner: root uid: 0 group:<br>
> >>>>>>>> root gid: 0<br>
> >>>>>>>><br>
> >>>> sge_o_home: /root<br>
> >>>>>>>> sge_o_log_name: root
sge_o_path:<br>
> >>>>>>>>
/home/apps/bin:/home/apps/vcftools_0.1.7/bin:/home/apps/tabix-0.2.5:/home/apps/BEDTools-Version-2.14.2/bin:/home/apps/samtools/bcftools:/home/apps/samtools:/home/apps/bwa-0.5.9:/home/apps/Python-2.7.2:/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/bin:/home/apps/sjm-1.0/bin:/home/apps/hugeseq/bin:/usr/lib64/openmpi/1.4-gcc/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/cuda/bin:/usr/local/cuda/computeprof/bin:/usr/local/cuda/open64/bin:/opt/sge6/bin/lx24-amd64:/root/bin<br>
> >>>>>>>><br>
> >>>>>>>><br>
> >>>>>>>><br>
> sge_o_shell: /bin/bash<br>
> >>>>>>>> sge_o_workdir:<br>
> >>>> /data/test<br>
> >>>>>>>> sge_o_host: master account:
sge<br>
> >>>>>>>> stderr_path_list:<br>
> >>>>>>>>
NONE:master:/data/log/SAMPLE.bin_aln-chr1_e111108173139.txt<br>
> >>>>>>>><br>
> >>>>>>>><br>
> >>>>>>>><br>
> *hard resource_list: h_vmem=12000M*<br>
> >>>>>>>> mail_list: root@master
notify: FALSE<br>
> >>>>>>>> job_name:
SAMPLE.bin_aln-chr1<br>
> >>>>>>>> stdout_path_list:<br>
> >>>>>>>>
NONE:master:/data/log/SAMPLE.bin_aln-chr1_o111108173139.txt<br>
> >>>>>>>><br>
> >>>>>>>><br>
> >>>>>>>><br>
> jobshare:<br>
> >>>> 0<br>
> >>>>>>>> hard_queue_list: all.q
env_list: job_args:<br>
> >>>>>>>>
-c,/home/apps/hugeseq/bin/hugeseq_mod.sh bin_sam.sh chr1<br>
> >>>>>>>> /data/chr1.bam
/data/bwa_small.bam &&<br>
> >>>>>>>>
/home/apps/hugeseq/bin/hugeseq_mod.sh sam_index.sh<br>
> >>>>>>>> /data/chr1.bam script_file:
/bin/sh<br>
> >>>>>>>> verify_suitable_queues: 2
scheduling info:<br>
> >>>>>>>> (Collecting of scheduler job
information is turned off)<br>
> >>>>>>>><br>
> >>>>>>>> And I'm using the Cluster
GPU Quadruple Extra Large<br>
> >>>>>>>> instances which<br>
> >>>> I<br>
> >>>>>>>> think has about 23G memory.
The issue that I see is too<br>
> >>>>>>>> many of the jobs are
submitted. I guess I need to set<br>
> >>>>>>>> the mem_free too? (the
problem is the tool im using does<br>
> >>>>>>>> not seem to have a way tot
set that...)<br>
> >>>>>>>><br>
> >>>>>>>> Many thanks, Amir<br>
> >>>>>>>><br>
> >>>>>>>> On Nov 8, 2011, at 5:47 AM,
Justin Riley wrote:<br>
> >>>>>>>><br>
> >>>>>>>>><br>
> >>>>>>> Hi Amirhossein,<br>
> >>>>>>><br>
> >>>>>>> Did you specify the memory usage
in your job script or at<br>
> >>>>>>> command line and what parameters
did you use exactly?<br>
> >>>>>>><br>
> >>>>>>> Doing a quick search I believe
that the following will<br>
> >>>>>>> solve the problem although I
haven't tested myself:<br>
> >>>>>>><br>
> >>>>>>> $ qsub -l
mem_free=MEM_NEEDED,h_vmem=MEM_MAX yourjob.sh<br>
> >>>>>>><br>
> >>>>>>> Here, MEM_NEEDED and MEM_MAX are
the lower and<br>
> >>>> upper bounds for your<br>
> >>>>>>> job's memory requirements.<br>
> >>>>>>><br>
> >>>>>>> HTH,<br>
> >>>>>>><br>
> >>>>>>> ~Justin<br>
> >>>>>>><br>
> >>>>>>> On 7/22/64 2:59 PM, Amirhossein
Kiani wrote:<br>
> >>>>>>>> Dear Star Cluster users,<br>
> >>>>>>><br>
> >>>>>>>> I'm using Star Cluster to
set up an SGE and when I ran<br>
> >>>>>>>> my job list,<br>
> >>>>>>> although I had specified the
memory usage for each job, it<br>
> >>>>>>> submitted too many jobs on my
instance and my instance<br>
> >>>>>>> started going out of memory and
swapping.<br>
> >>>>>>>> I wonder if anyone knows how
I could tell the SGE the<br>
> >>>>>>>> max memory to<br>
> >>>>>>> consider when submitting jobs to
each node so that it<br>
> >>>>>>> doesn't run the jobs if there is
not enough memory<br>
> >>>>>>> available on a node.<br>
> >>>>>>><br>
> >>>>>>>> I'm using the Cluster GPU
Quadruple Extra Large<br>
> >>>>>>>> instances.<br>
> >>>>>>><br>
> >>>>>>>> Many thanks, Amirhossein
Kiani<br>
> >>>>>>><br>
> >>>>>>>>><br>
> >>>>>>>><br>
> >>>>>>><br>
> >>>>>>><br>
> >>>>>>><br>
> >>>>>>>
_______________________________________________<br>
> >>>>>>> StarCluster mailing list
<a class="moz-txt-link-abbreviated" href="mailto:StarCluster@mit.edu">StarCluster@mit.edu</a><br>
> >>>>>>>
<a class="moz-txt-link-freetext" href="http://mailman.mit.edu/mailman/listinfo/starcluster">http://mailman.mit.edu/mailman/listinfo/starcluster</a><br>
> >>>>>><br>
> >>>>><br>
> >>>>><br>
> >>>>>
_______________________________________________ StarCluster<br>
> >>>>> mailing list <a class="moz-txt-link-abbreviated" href="mailto:StarCluster@mit.edu">StarCluster@mit.edu</a><br>
> >>>>>
<a class="moz-txt-link-freetext" href="http://mailman.mit.edu/mailman/listinfo/starcluster">http://mailman.mit.edu/mailman/listinfo/starcluster</a><br>
> >>>>><br>
> >>>>><br>
> >>>>><br>
> >>><br>
> >>><br>
> >>> _______________________________________________
StarCluster<br>
> >>> mailing list <a class="moz-txt-link-abbreviated" href="mailto:StarCluster@mit.edu">StarCluster@mit.edu</a><br>
> >>>
<a class="moz-txt-link-freetext" href="http://mailman.mit.edu/mailman/listinfo/starcluster">http://mailman.mit.edu/mailman/listinfo/starcluster</a><br>
><br>
></span><br>
<br>
-----BEGIN PGP SIGNATURE-----<br>
Version: GnuPG v1.4.11 (Darwin)<br>
Comment: Using GnuPG with Mozilla - <a class="moz-txt-link-freetext" href="http://enigmail.mozdev.org/">http://enigmail.mozdev.org/</a><br>
<br>
iEYEARECAAYFAk7dYA8ACgkQ4llAkMfDcrl3/gCfWl/niHCWOAmdAe9kRF5I6r//<br>
bTQAnjM5LpXxNLrPX7Pr+lXlxkJTkBJN<br>
=9p5j<br>
-----END PGP SIGNATURE-----<br>
<br>
</body>
</html>