[Starcluster] Load Balancer Problems
Rajat Banerjee
rbanerj at fas.harvard.edu
Fri Jul 30 16:48:26 EDT 2010
Hey Amaro,
Thanks for the feedback. It looks like your SGE queue is much more
sophisticated than mine. If I run "qstat -xml" it outputs a ton of
info, but I'm guessing that yours would not.
I assume you're using the latest code, in "develop" mode? (Did you run
"python setup.py develop" when you started working?)
If so, open up the python file starcluster/balancers/sge/__init__.py
and change this line #342:
qstatXml = '\n'.join(master.ssh.execute('source /etc/profile && qstat -xml', \
log_output=False))
to the following:
qstatXml = '\n'.join(master.ssh.execute('source /etc/profile && qstat
-xml -q all.q -f -u "*"', \
log_output=False))
I modified the args to qstat. If that works for you, I can test it and
check it into the branch.
Thanks,
Rajat
On Fri, Jul 30, 2010 at 4:40 PM, Amaro Taylor
<amaro.taylor at resgroupinc.com> wrote:
> Hey,
>
> So I was testing out the Load Balancer today and it doesnt appear to be
> working. Here is the output I was getting and the output from the job on
> startcluster.
>
> ssh.py:248 - ERROR - command source /etc/profile && qacct -j -b 201007301725
> failed with status 1
>>>> Oldest job is from None. # queued jobs = 0. # hosts = 2.
>>>> Avg job duration = 0 sec, Avg wait time = 0 sec.
>>>> Cluster change was made less than 180 seconds ago (2010-07-30
>>>> 20:24:13.398974).
>>>> Not changing cluster size until cluster stabilizes.
>>>> Sleeping, looping again in 60 seconds.
>
>
> It says 0 queued jobs but thats not accurate.
> this is what qstat says on the master node
>
> #########################################################################
> 1 0.55500 Bone_Estim sgeadmin qw 07/30/2010 20:26:20 1
> 7-1000:1
> sgeadmin at domU-12-31-39-01-5D-67:~/jacobian-parallel/test/bone$ qstat -q
> all.q -f -u "*"
> queuename qtype resv/used/tot. load_avg arch
> states
> ---------------------------------------------------------------------------------
> all.q at domU-12-31-39-01-5C-97.c BIP 0/1/1 0.52 lx24-x86
> 1 0.55500 Bone_Estim sgeadmin r 07/30/2010 20:29:03 1 6
> ---------------------------------------------------------------------------------
> all.q at domU-12-31-39-01-5D-67.c BIP 0/1/1 1.22 lx24-x86
> 1 0.55500 Bone_Estim sgeadmin r 07/30/2010 20:28:33 1 5
>
> ############################################################################
> - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
> ############################################################################
> 1 0.55500 Bone_Estim sgeadmin qw 07/30/2010 20:26:20 1
> 7-1000:1
> sgeadmin at domU-12-31-39-01-5D-67:~/jacobian-parallel/test/bone$ qstat -q
> all.q -f -u "*"
> queuename qtype resv/used/tot. load_avg arch
> states
> ---------------------------------------------------------------------------------
> all.q at domU-12-31-39-01-5C-97.c BIP 0/1/1 0.63 lx24-x86
> 1 0.55500 Bone_Estim sgeadmin r 07/30/2010 20:31:03 1 8
> ---------------------------------------------------------------------------------
> all.q at domU-12-31-39-01-5D-67.c BIP 0/1/1 1.38 lx24-x86
> 1 0.55500 Bone_Estim sgeadmin r 07/30/2010 20:28:33 1 5
>
> Any suggestions?
>
>
>
> Best,
> Amaro Taylor
> RES Group, Inc.
> 1 Broadway • Cambridge, MA 02142 • U.S.A.
> Tel: 310 880-1906 (Direct) • Fax: 617-812-8042 • Email:
> amaro.taylor at resgroupinc.com
>
> Disclaimer: The information contained in this email message may be
> confidential. Please be careful if you forward, copy or print this message.
> If you have received this email in error, please immediately notify the
> sender and delete the message.
>
> _______________________________________________
> Starcluster mailing list
> Starcluster at mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
More information about the StarCluster
mailing list