Hey Rajat,<div><br></div><div>So I tested out the load balancer some today. We ran into 2 problems. The first is that we submitted an array of jobs to the queue. The balancer is treating the array as one job and not recognizing that it needs to open up more nodes. The second problem is that the logic in closing down nodes isn&#39;t taking into account the hour limits beyond the first 45. For example if our instance has been up for 61 minutes we&#39;ve bought the second hour and don&#39;t want to just close that instance. I have attached the xml output. </div>


<div><br></div><div>Best<br clear="all">Amaro Taylor<br>RES Group, Inc.<br>1 Broadway • Cambridge, MA 02142 • U.S.A.<br>Tel: 310 880-1906 (Direct) • Fax: 617-812-8042 • Email: <a href="mailto:amaro.taylor@resgroupinc.com">amaro.taylor@resgroupinc.com</a><br>


<br>Disclaimer: The information contained in this email message may be confidential. Please be careful if you forward, copy or print this message. If you have received this email in error, please immediately notify the sender and delete the message. <br>


<br><br><div class="gmail_quote">On Sun, Aug 1, 2010 at 4:12 PM, Rajat Banerjee <span dir="ltr">&lt;<a href="mailto:rbanerj@fas.harvard.edu">rbanerj@fas.harvard.edu</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">


Hi,<br>

I made a fix and committed it.<br>

<br>

<a href="http://github.com/rqbanerjee/StarCluster/commit/bace3075d9ab2f891f1b50981f5ef657e7bb0cfb" target="_blank">http://github.com/rqbanerjee/StarCluster/commit/bace3075d9ab2f891f1b50981f5ef657e7bb0cfb</a><br>

<br>

You can pull from github to get the latest stuff. I switched my basic<br>

&quot;qstat -xml&quot; to a larger search: &#39;qstat -q all.q -u \&quot;*\&quot; -xml&#39; , it<br>

seems to get the entire job queue on my cluster. Please let me know if<br>

it gets the right job queue on your cluster.<br>

<br>

Thanks,<br>

<font color="#888888">Rajat<br>

</font><div><div></div><div class="h5"><br>

On Fri, Jul 30, 2010 at 4:48 PM, Rajat Banerjee &lt;<a href="mailto:rbanerj@fas.harvard.edu">rbanerj@fas.harvard.edu</a>&gt; wrote:<br>

&gt; Hey Amaro,<br>

&gt; Thanks for the feedback. It looks like your SGE queue is much more<br>

&gt; sophisticated than mine. If I run &quot;qstat -xml&quot; it outputs a ton of<br>

&gt; info, but I&#39;m guessing that yours would not.<br>

&gt;<br>

&gt; I assume you&#39;re using the latest code, in &quot;develop&quot; mode? (Did you run<br>

&gt; &quot;python setup.py develop&quot; when you started working?)<br>

&gt;<br>

&gt; If so, open up the python file starcluster/balancers/sge/__init__.py<br>

&gt; and change this line #342:<br>

&gt;<br>

&gt; qstatXml = &#39;\n&#39;.join(master.ssh.execute(&#39;source /etc/profile &amp;&amp; qstat -xml&#39;, \<br>

&gt;                                                    log_output=False))<br>

&gt;<br>

&gt; to the following:<br>

&gt;<br>

&gt; qstatXml = &#39;\n&#39;.join(master.ssh.execute(&#39;source /etc/profile &amp;&amp; qstat<br>

&gt; -xml -q all.q -f -u &quot;*&quot;&#39;, \<br>

&gt;                                                    log_output=False))<br>

&gt;<br>

&gt; I modified the args to qstat. If that works for you, I can test it and<br>

&gt; check it into the branch.<br>

&gt; Thanks,<br>

&gt; Rajat<br>

&gt;<br>

&gt; On Fri, Jul 30, 2010 at 4:40 PM, Amaro Taylor<br>

&gt; &lt;<a href="mailto:amaro.taylor@resgroupinc.com">amaro.taylor@resgroupinc.com</a>&gt; wrote:<br>

&gt;&gt; Hey,<br>

&gt;&gt;<br>

&gt;&gt; So I was testing out the Load Balancer today and it doesnt appear to be<br>

&gt;&gt; working. Here is the output I was getting and the output from the job on<br>

&gt;&gt; startcluster.<br>

&gt;&gt;<br>

&gt;&gt; ssh.py:248 - ERROR - command source /etc/profile &amp;&amp; qacct -j -b 201007301725<br>

&gt;&gt; failed with status 1<br>

&gt;&gt;&gt;&gt;&gt; Oldest job is from None. # queued jobs = 0. # hosts = 2.<br>

&gt;&gt;&gt;&gt;&gt; Avg job duration = 0 sec, Avg wait time = 0 sec.<br>

&gt;&gt;&gt;&gt;&gt; Cluster change was made less than 180 seconds ago (2010-07-30<br>

&gt;&gt;&gt;&gt;&gt; 20:24:13.398974).<br>

&gt;&gt;&gt;&gt;&gt; Not changing cluster size until cluster stabilizes.<br>

&gt;&gt;&gt;&gt;&gt; Sleeping, looping again in 60 seconds.<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt; It says 0 queued jobs but thats not accurate.<br>

&gt;&gt; this is what qstat says on the master node<br>

&gt;&gt;<br>

&gt;&gt; #########################################################################<br>

&gt;&gt;       1 0.55500 Bone_Estim sgeadmin     qw    07/30/2010 20:26:20     1<br>

&gt;&gt; 7-1000:1<br>

&gt;&gt; sgeadmin@domU-12-31-39-01-5D-67:~/jacobian-parallel/test/bone$ qstat -q<br>

&gt;&gt; all.q -f -u &quot;*&quot;<br>

&gt;&gt; queuename                      qtype resv/used/tot. load_avg arch<br>

&gt;&gt; states<br>

&gt;&gt; ---------------------------------------------------------------------------------<br>

&gt;&gt; all.q@domU-12-31-39-01-5C-97.c BIP   0/1/1          0.52     lx24-x86<br>

&gt;&gt;       1 0.55500 Bone_Estim sgeadmin     r     07/30/2010 20:29:03     1 6<br>

&gt;&gt; ---------------------------------------------------------------------------------<br>

&gt;&gt; all.q@domU-12-31-39-01-5D-67.c BIP   0/1/1          1.22     lx24-x86<br>

&gt;&gt;       1 0.55500 Bone_Estim sgeadmin     r     07/30/2010 20:28:33     1 5<br>

&gt;&gt;<br>

&gt;&gt; ############################################################################<br>

&gt;&gt;  - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS<br>

&gt;&gt; ############################################################################<br>

&gt;&gt;       1 0.55500 Bone_Estim sgeadmin     qw    07/30/2010 20:26:20     1<br>

&gt;&gt; 7-1000:1<br>

&gt;&gt; sgeadmin@domU-12-31-39-01-5D-67:~/jacobian-parallel/test/bone$ qstat -q<br>

&gt;&gt; all.q -f -u &quot;*&quot;<br>

&gt;&gt; queuename                      qtype resv/used/tot. load_avg arch<br>

&gt;&gt; states<br>

&gt;&gt; ---------------------------------------------------------------------------------<br>

&gt;&gt; all.q@domU-12-31-39-01-5C-97.c BIP   0/1/1          0.63     lx24-x86<br>

&gt;&gt;       1 0.55500 Bone_Estim sgeadmin     r     07/30/2010 20:31:03     1 8<br>

&gt;&gt; ---------------------------------------------------------------------------------<br>

&gt;&gt; all.q@domU-12-31-39-01-5D-67.c BIP   0/1/1          1.38     lx24-x86<br>

&gt;&gt;       1 0.55500 Bone_Estim sgeadmin     r     07/30/2010 20:28:33     1 5<br>

&gt;&gt;<br>

&gt;&gt; Any suggestions?<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt; Best,<br>

&gt;&gt; Amaro Taylor<br>

&gt;&gt; RES Group, Inc.<br>

&gt;&gt; 1 Broadway • Cambridge, MA 02142 • U.S.A.<br>

&gt;&gt; Tel: 310 880-1906 (Direct) • Fax: 617-812-8042 • Email:<br>

&gt;&gt; <a href="mailto:amaro.taylor@resgroupinc.com">amaro.taylor@resgroupinc.com</a><br>

&gt;&gt;<br>

&gt;&gt; Disclaimer: The information contained in this email message may be<br>

&gt;&gt; confidential. Please be careful if you forward, copy or print this message.<br>

&gt;&gt; If you have received this email in error, please immediately notify the<br>

&gt;&gt; sender and delete the message.<br>

&gt;&gt;<br>

&gt;&gt; _______________________________________________<br>

&gt;&gt; Starcluster mailing list<br>

&gt;&gt; <a href="mailto:Starcluster@mit.edu">Starcluster@mit.edu</a><br>

&gt;&gt; <a href="http://mailman.mit.edu/mailman/listinfo/starcluster" target="_blank">http://mailman.mit.edu/mailman/listinfo/starcluster</a><br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;<br>

</div></div></blockquote></div><br></div>