[Starcluster] Load Balancer Problems
Rajat Banerjee
rbanerj at fas.harvard.edu
Sun Aug 1 16:12:10 EDT 2010
Hi,
I made a fix and committed it.
http://github.com/rqbanerjee/StarCluster/commit/bace3075d9ab2f891f1b50981f5ef657e7bb0cfb
You can pull from github to get the latest stuff. I switched my basic
"qstat -xml" to a larger search: 'qstat -q all.q -u \"*\" -xml' , it
seems to get the entire job queue on my cluster. Please let me know if
it gets the right job queue on your cluster.
Thanks,
Rajat
On Fri, Jul 30, 2010 at 4:48 PM, Rajat Banerjee <rbanerj at fas.harvard.edu> wrote:
> Hey Amaro,
> Thanks for the feedback. It looks like your SGE queue is much more
> sophisticated than mine. If I run "qstat -xml" it outputs a ton of
> info, but I'm guessing that yours would not.
>
> I assume you're using the latest code, in "develop" mode? (Did you run
> "python setup.py develop" when you started working?)
>
> If so, open up the python file starcluster/balancers/sge/__init__.py
> and change this line #342:
>
> qstatXml = '\n'.join(master.ssh.execute('source /etc/profile && qstat -xml', \
> log_output=False))
>
> to the following:
>
> qstatXml = '\n'.join(master.ssh.execute('source /etc/profile && qstat
> -xml -q all.q -f -u "*"', \
> log_output=False))
>
> I modified the args to qstat. If that works for you, I can test it and
> check it into the branch.
> Thanks,
> Rajat
>
> On Fri, Jul 30, 2010 at 4:40 PM, Amaro Taylor
> <amaro.taylor at resgroupinc.com> wrote:
>> Hey,
>>
>> So I was testing out the Load Balancer today and it doesnt appear to be
>> working. Here is the output I was getting and the output from the job on
>> startcluster.
>>
>> ssh.py:248 - ERROR - command source /etc/profile && qacct -j -b 201007301725
>> failed with status 1
>>>>> Oldest job is from None. # queued jobs = 0. # hosts = 2.
>>>>> Avg job duration = 0 sec, Avg wait time = 0 sec.
>>>>> Cluster change was made less than 180 seconds ago (2010-07-30
>>>>> 20:24:13.398974).
>>>>> Not changing cluster size until cluster stabilizes.
>>>>> Sleeping, looping again in 60 seconds.
>>
>>
>> It says 0 queued jobs but thats not accurate.
>> this is what qstat says on the master node
>>
>> #########################################################################
>> 1 0.55500 Bone_Estim sgeadmin qw 07/30/2010 20:26:20 1
>> 7-1000:1
>> sgeadmin at domU-12-31-39-01-5D-67:~/jacobian-parallel/test/bone$ qstat -q
>> all.q -f -u "*"
>> queuename qtype resv/used/tot. load_avg arch
>> states
>> ---------------------------------------------------------------------------------
>> all.q at domU-12-31-39-01-5C-97.c BIP 0/1/1 0.52 lx24-x86
>> 1 0.55500 Bone_Estim sgeadmin r 07/30/2010 20:29:03 1 6
>> ---------------------------------------------------------------------------------
>> all.q at domU-12-31-39-01-5D-67.c BIP 0/1/1 1.22 lx24-x86
>> 1 0.55500 Bone_Estim sgeadmin r 07/30/2010 20:28:33 1 5
>>
>> ############################################################################
>> - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
>> ############################################################################
>> 1 0.55500 Bone_Estim sgeadmin qw 07/30/2010 20:26:20 1
>> 7-1000:1
>> sgeadmin at domU-12-31-39-01-5D-67:~/jacobian-parallel/test/bone$ qstat -q
>> all.q -f -u "*"
>> queuename qtype resv/used/tot. load_avg arch
>> states
>> ---------------------------------------------------------------------------------
>> all.q at domU-12-31-39-01-5C-97.c BIP 0/1/1 0.63 lx24-x86
>> 1 0.55500 Bone_Estim sgeadmin r 07/30/2010 20:31:03 1 8
>> ---------------------------------------------------------------------------------
>> all.q at domU-12-31-39-01-5D-67.c BIP 0/1/1 1.38 lx24-x86
>> 1 0.55500 Bone_Estim sgeadmin r 07/30/2010 20:28:33 1 5
>>
>> Any suggestions?
>>
>>
>>
>> Best,
>> Amaro Taylor
>> RES Group, Inc.
>> 1 Broadway • Cambridge, MA 02142 • U.S.A.
>> Tel: 310 880-1906 (Direct) • Fax: 617-812-8042 • Email:
>> amaro.taylor at resgroupinc.com
>>
>> Disclaimer: The information contained in this email message may be
>> confidential. Please be careful if you forward, copy or print this message.
>> If you have received this email in error, please immediately notify the
>> sender and delete the message.
>>
>> _______________________________________________
>> Starcluster mailing list
>> Starcluster at mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>>
>
More information about the StarCluster
mailing list