[Starcluster] Load Balancer Problems

Amaro Taylor amaro.taylor at resgroupinc.com
Mon Aug 2 12:51:43 EDT 2010


Hey Rajat,

So I tested out the load balancer some today. We ran into 2 problems. The
first is that we submitted an array of jobs to the queue. The balancer is
treating the array as one job and not recognizing that it needs to open up
more nodes. The second problem is that the logic in closing down nodes isn't
taking into account the hour limits beyond the first 45. For example if our
instance has been up for 61 minutes we've bought the second hour
and don't want to just close that instance. I have attached the xml output.

Best
Amaro Taylor
RES Group, Inc.
1 Broadway • Cambridge, MA 02142 • U.S.A.
Tel: 310 880-1906 (Direct) • Fax: 617-812-8042 • Email:
amaro.taylor at resgroupinc.com

Disclaimer: The information contained in this email message may be
confidential. Please be careful if you forward, copy or print this message.
If you have received this email in error, please immediately notify the
sender and delete the message.


On Sun, Aug 1, 2010 at 4:12 PM, Rajat Banerjee <rbanerj at fas.harvard.edu>wrote:

> Hi,
> I made a fix and committed it.
>
>
> http://github.com/rqbanerjee/StarCluster/commit/bace3075d9ab2f891f1b50981f5ef657e7bb0cfb
>
> You can pull from github to get the latest stuff. I switched my basic
> "qstat -xml" to a larger search: 'qstat -q all.q -u \"*\" -xml' , it
> seems to get the entire job queue on my cluster. Please let me know if
> it gets the right job queue on your cluster.
>
> Thanks,
> Rajat
>
> On Fri, Jul 30, 2010 at 4:48 PM, Rajat Banerjee <rbanerj at fas.harvard.edu>
> wrote:
> > Hey Amaro,
> > Thanks for the feedback. It looks like your SGE queue is much more
> > sophisticated than mine. If I run "qstat -xml" it outputs a ton of
> > info, but I'm guessing that yours would not.
> >
> > I assume you're using the latest code, in "develop" mode? (Did you run
> > "python setup.py develop" when you started working?)
> >
> > If so, open up the python file starcluster/balancers/sge/__init__.py
> > and change this line #342:
> >
> > qstatXml = '\n'.join(master.ssh.execute('source /etc/profile && qstat
> -xml', \
> >                                                    log_output=False))
> >
> > to the following:
> >
> > qstatXml = '\n'.join(master.ssh.execute('source /etc/profile && qstat
> > -xml -q all.q -f -u "*"', \
> >                                                    log_output=False))
> >
> > I modified the args to qstat. If that works for you, I can test it and
> > check it into the branch.
> > Thanks,
> > Rajat
> >
> > On Fri, Jul 30, 2010 at 4:40 PM, Amaro Taylor
> > <amaro.taylor at resgroupinc.com> wrote:
> >> Hey,
> >>
> >> So I was testing out the Load Balancer today and it doesnt appear to be
> >> working. Here is the output I was getting and the output from the job on
> >> startcluster.
> >>
> >> ssh.py:248 - ERROR - command source /etc/profile && qacct -j -b
> 201007301725
> >> failed with status 1
> >>>>> Oldest job is from None. # queued jobs = 0. # hosts = 2.
> >>>>> Avg job duration = 0 sec, Avg wait time = 0 sec.
> >>>>> Cluster change was made less than 180 seconds ago (2010-07-30
> >>>>> 20:24:13.398974).
> >>>>> Not changing cluster size until cluster stabilizes.
> >>>>> Sleeping, looping again in 60 seconds.
> >>
> >>
> >> It says 0 queued jobs but thats not accurate.
> >> this is what qstat says on the master node
> >>
> >>
> #########################################################################
> >>       1 0.55500 Bone_Estim sgeadmin     qw    07/30/2010 20:26:20     1
> >> 7-1000:1
> >> sgeadmin at domU-12-31-39-01-5D-67:~/jacobian-parallel/test/bone$ qstat -q
> >> all.q -f -u "*"
> >> queuename                      qtype resv/used/tot. load_avg arch
> >> states
> >>
> ---------------------------------------------------------------------------------
> >> all.q at domU-12-31-39-01-5C-97.c BIP   0/1/1          0.52     lx24-x86
> >>       1 0.55500 Bone_Estim sgeadmin     r     07/30/2010 20:29:03     1
> 6
> >>
> ---------------------------------------------------------------------------------
> >> all.q at domU-12-31-39-01-5D-67.c BIP   0/1/1          1.22     lx24-x86
> >>       1 0.55500 Bone_Estim sgeadmin     r     07/30/2010 20:28:33     1
> 5
> >>
> >>
> ############################################################################
> >>  - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING
> JOBS
> >>
> ############################################################################
> >>       1 0.55500 Bone_Estim sgeadmin     qw    07/30/2010 20:26:20     1
> >> 7-1000:1
> >> sgeadmin at domU-12-31-39-01-5D-67:~/jacobian-parallel/test/bone$ qstat -q
> >> all.q -f -u "*"
> >> queuename                      qtype resv/used/tot. load_avg arch
> >> states
> >>
> ---------------------------------------------------------------------------------
> >> all.q at domU-12-31-39-01-5C-97.c BIP   0/1/1          0.63     lx24-x86
> >>       1 0.55500 Bone_Estim sgeadmin     r     07/30/2010 20:31:03     1
> 8
> >>
> ---------------------------------------------------------------------------------
> >> all.q at domU-12-31-39-01-5D-67.c BIP   0/1/1          1.38     lx24-x86
> >>       1 0.55500 Bone_Estim sgeadmin     r     07/30/2010 20:28:33     1
> 5
> >>
> >> Any suggestions?
> >>
> >>
> >>
> >> Best,
> >> Amaro Taylor
> >> RES Group, Inc.
> >> 1 Broadway • Cambridge, MA 02142 • U.S.A.
> >> Tel: 310 880-1906 (Direct) • Fax: 617-812-8042 • Email:
> >> amaro.taylor at resgroupinc.com
> >>
> >> Disclaimer: The information contained in this email message may be
> >> confidential. Please be careful if you forward, copy or print this
> message.
> >> If you have received this email in error, please immediately notify the
> >> sender and delete the message.
> >>
> >> _______________________________________________
> >> Starcluster mailing list
> >> Starcluster at mit.edu
> >> http://mailman.mit.edu/mailman/listinfo/starcluster
> >>
> >>
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20100802/26b32481/attachment.htm
-------------- next part --------------
A non-text attachment was scrubbed...
Name: qstat.out
Type: application/octet-stream
Size: 1062 bytes
Desc: not available
Url : http://mailman.mit.edu/pipermail/starcluster/attachments/20100802/26b32481/attachment.obj


More information about the StarCluster mailing list