[Starcluster] Load Balancer Problems

Mon Aug 2 13:10:06 EDT 2010

Hey Amaro,

On the two points, is it possible for me to give you a call regarding
arrays of jobs? I would like to understand how jobs are created in
arrays in order to support that scenario.

Regarding the other point, a host being killed at 61 minutes, that
seems improbable. The code to calculate how long the host has been up
looks like this:

mins_up = self._minutes_uptime(node) % 60

(balancers/sge/__init__.py : 546), so it could never be 61 minutes. Is
this code running late due to the balancer running other operations?
If so, could you please send me /tmp/starcluster-debug.log ? Where did
you determine the 61 minutes age?

Thank you,
Rajat
On Mon, Aug 2, 2010 at 12:51 PM, Amaro Taylor
<amaro.taylor at resgroupinc.com> wrote:
> Hey Rajat,
> So I tested out the load balancer some today. We ran into 2 problems. The
> first is that we submitted an array of jobs to the queue. The balancer is
> treating the array as one job and not recognizing that it needs to open up
> more nodes. The second problem is that the logic in closing down nodes isn't
> taking into account the hour limits beyond the first 45. For example if our
> instance has been up for 61 minutes we've bought the second hour
> and don't want to just close that instance. I have attached the xml output.
> Best
> Amaro Taylor
> RES Group, Inc.
> 1 Broadway • Cambridge, MA 02142 • U.S.A.
> Tel: 310 880-1906 (Direct) • Fax: 617-812-8042 • Email:
> amaro.taylor at resgroupinc.com
>
> Disclaimer: The information contained in this email message may be
> confidential. Please be careful if you forward, copy or print this message.
> If you have received this email in error, please immediately notify the
> sender and delete the message.
>
>
> On Sun, Aug 1, 2010 at 4:12 PM, Rajat Banerjee <rbanerj at fas.harvard.edu>
> wrote:
>>
>> Hi,
>> I made a fix and committed it.
>>
>>
>> http://github.com/rqbanerjee/StarCluster/commit/bace3075d9ab2f891f1b50981f5ef657e7bb0cfb
>>
>> You can pull from github to get the latest stuff. I switched my basic
>> "qstat -xml" to a larger search: 'qstat -q all.q -u \"*\" -xml' , it
>> seems to get the entire job queue on my cluster. Please let me know if
>> it gets the right job queue on your cluster.
>>
>> Thanks,
>> Rajat
>>
>> On Fri, Jul 30, 2010 at 4:48 PM, Rajat Banerjee <rbanerj at fas.harvard.edu>
>> wrote:
>> > Hey Amaro,
>> > Thanks for the feedback. It looks like your SGE queue is much more
>> > sophisticated than mine. If I run "qstat -xml" it outputs a ton of
>> > info, but I'm guessing that yours would not.
>> >
>> > I assume you're using the latest code, in "develop" mode? (Did you run
>> > "python setup.py develop" when you started working?)
>> >
>> > If so, open up the python file starcluster/balancers/sge/__init__.py
>> > and change this line #342:
>> >
>> > qstatXml = '\n'.join(master.ssh.execute('source /etc/profile && qstat
>> > -xml', \
>> >                                                    log_output=False))
>> >
>> > to the following:
>> >
>> > qstatXml = '\n'.join(master.ssh.execute('source /etc/profile && qstat
>> > -xml -q all.q -f -u "*"', \
>> >                                                    log_output=False))
>> >
>> > I modified the args to qstat. If that works for you, I can test it and
>> > check it into the branch.
>> > Thanks,
>> > Rajat
>> >
>> > On Fri, Jul 30, 2010 at 4:40 PM, Amaro Taylor
>> > <amaro.taylor at resgroupinc.com> wrote:
>> >> Hey,
>> >>
>> >> So I was testing out the Load Balancer today and it doesnt appear to be
>> >> working. Here is the output I was getting and the output from the job
>> >> on
>> >> startcluster.
>> >>
>> >> ssh.py:248 - ERROR - command source /etc/profile && qacct -j -b
>> >> 201007301725
>> >> failed with status 1
>> >>>>> Oldest job is from None. # queued jobs = 0. # hosts = 2.
>> >>>>> Avg job duration = 0 sec, Avg wait time = 0 sec.
>> >>>>> Cluster change was made less than 180 seconds ago (2010-07-30
>> >>>>> 20:24:13.398974).
>> >>>>> Not changing cluster size until cluster stabilizes.
>> >>>>> Sleeping, looping again in 60 seconds.
>> >>
>> >>
>> >> It says 0 queued jobs but thats not accurate.
>> >> this is what qstat says on the master node
>> >>
>> >>
>> >> #########################################################################
>> >>       1 0.55500 Bone_Estim sgeadmin     qw    07/30/2010 20:26:20     1
>> >> 7-1000:1
>> >> sgeadmin at domU-12-31-39-01-5D-67:~/jacobian-parallel/test/bone$ qstat -q
>> >> all.q -f -u "*"
>> >> queuename                      qtype resv/used/tot. load_avg arch
>> >> states
>> >>
>> >> ---------------------------------------------------------------------------------
>> >> all.q at domU-12-31-39-01-5C-97.c BIP   0/1/1          0.52     lx24-x86
>> >>       1 0.55500 Bone_Estim sgeadmin     r     07/30/2010 20:29:03     1
>> >> 6
>> >>
>> >> ---------------------------------------------------------------------------------
>> >> all.q at domU-12-31-39-01-5D-67.c BIP   0/1/1          1.22     lx24-x86
>> >>       1 0.55500 Bone_Estim sgeadmin     r     07/30/2010 20:28:33     1
>> >> 5
>> >>
>> >>
>> >> ############################################################################
>> >>  - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING
>> >> JOBS
>> >>
>> >> ############################################################################
>> >>       1 0.55500 Bone_Estim sgeadmin     qw    07/30/2010 20:26:20     1
>> >> 7-1000:1
>> >> sgeadmin at domU-12-31-39-01-5D-67:~/jacobian-parallel/test/bone$ qstat -q
>> >> all.q -f -u "*"
>> >> queuename                      qtype resv/used/tot. load_avg arch
>> >> states
>> >>
>> >> ---------------------------------------------------------------------------------
>> >> all.q at domU-12-31-39-01-5C-97.c BIP   0/1/1          0.63     lx24-x86
>> >>       1 0.55500 Bone_Estim sgeadmin     r     07/30/2010 20:31:03     1
>> >> 8
>> >>
>> >> ---------------------------------------------------------------------------------
>> >> all.q at domU-12-31-39-01-5D-67.c BIP   0/1/1          1.38     lx24-x86
>> >>       1 0.55500 Bone_Estim sgeadmin     r     07/30/2010 20:28:33     1
>> >> 5
>> >>
>> >> Any suggestions?
>> >>
>> >>
>> >>
>> >> Best,
>> >> Amaro Taylor
>> >> RES Group, Inc.
>> >> 1 Broadway • Cambridge, MA 02142 • U.S.A.
>> >> Tel: 310 880-1906 (Direct) • Fax: 617-812-8042 • Email:
>> >> amaro.taylor at resgroupinc.com
>> >>
>> >> Disclaimer: The information contained in this email message may be
>> >> confidential. Please be careful if you forward, copy or print this
>> >> message.
>> >> If you have received this email in error, please immediately notify the
>> >> sender and delete the message.
>> >>
>> >> _______________________________________________
>> >> Starcluster mailing list
>> >> Starcluster at mit.edu
>> >> http://mailman.mit.edu/mailman/listinfo/starcluster
>> >>
>> >>
>> >
>
>