[StarCluster] Possible bug in loadbalancer?

Justin Riley jtriley at MIT.EDU
Wed May 18 10:59:42 EDT 2011


Hey Raj/Don,

Looking at the log, the load balancer eventually throws an exception parsing qhost/qstat's XML and completely halts his ELB session (forcing a restart). Apparently there are cases where qstat/qhost return invalid XML most likely in the case of some error message. Without seeing what's being returned this is difficult to fix and currently log_output=False when running the qhost and qstat command. Since this is hard to reproduce what I'll do is remove the log_output=False from those commands and if it happens again we'll be able to see what was returned by SGE in this case from the logs. 

Don, I will send another email when this change has been made. You'll need to pull the latest code for it to take effect. Please send us the relevant logs if this happens again and we should be able to come up with a fix for it.

~Justin


On May 17, 2011, at 9:37 AM, Rajat Banerjee wrote:

> Hi Don,
> Did this halt your ELB and force you to restart it? If not, I wouldn't
> worry about it. When SGE is making internal changes (adding a node,
> removing a node, or just starting up), the calls to qhost or qstat
> will periodically return a bad exit code, causing the following two
> errors you noticed:
> PID: 1822 ssh.py:397 - ERROR - command 'source /etc/profile && qhost
> -xml' failed with status 127
> PID: 1822 ssh.py:397 - ERROR - command 'source /etc/profile && qstat
> -q all.q -u "*" -xml' failed with status 127
> 
> 
> If the ELB was able to continue working, I would not consider it a
> bug. qstat and qhost will be called again in the next polling interval
> and the data store will be populated with job data then.
> 
> Thanks for you feedback, and let us know if you have more issues.
> Raj
> 
> 
> On Tue, May 17, 2011 at 12:15 AM, Don MacMillen <macd at physware.com> wrote:
>> Hi,
>> 
>> Trolling though the log files I found several errors which were all of the
>> following form.
>> I believe I have been using ELB correctly, but anything is possible and I do
>> not yet
>> have much experience with it.  This version is from the git repo which I
>> cloned last
>> Friday.
>> 
>> I doubt that I can do much to reproduce this error, so if this trace helps,
>> that's great.
>> I will keep an eye out for other misbehavior as well.
>> 
>> Thanks for the great work you guys have put into starcluster and the elb.
>> 
>> Regards,
>> 
>> Don MacMillen
>> PhysWare
>> 
>> 
>> PID: 1822 __init__.py:481 - INFO - Jobstats cache is not full. Pulling full
>> job history.
>> PID: 1822 __init__.py:486 - DEBUG - getting past 10800 seconds worth of job
>> history.
>> PID: 1822 ssh.py:397 - ERROR - command 'source /etc/profile && qhost -xml'
>> failed with status 127
>> PID: 1822 ssh.py:397 - ERROR - command 'source /etc/profile && qstat -q
>> all.q -u "*" -xml' failed with status 127
>> PID: 1822 ssh.py:400 - DEBUG - command source /etc/profile && qacct -j -b
>> 201105162111 failed with status 127
>> PID: 1822 __init__.py:524 - DEBUG - sizes: qhost: 30, qstat: 30, qacct:
>> 30.
>> PID: 1822 cli.py:184 - DEBUG - Traceback (most recent call
>> last):
>>   File "build/bdist.linux-i686/egg/starcluster/cli.py", line 160, in
>> main
>> 
>> sc.execute(args)
>>   File "build/bdist.linux-i686/egg/starcluster/commands/loadbalance.py",
>> line 91, in execute
>> 
>> lb.run(cluster)
>>   File "build/bdist.linux-i686/egg/starcluster/balancers/sge/__init__.py",
>> line 570, in run
>>     if self.get_stats() ==
>> -1:
>>   File "build/bdist.linux-i686/egg/starcluster/balancers/sge/__init__.py",
>> line 525, in get_stats
>> 
>> self.stat.parse_qhost(qhostxml)
>>   File "build/bdist.linux-i686/egg/starcluster/balancers/sge/__init__.py",
>> line 50, in parse_qhost
>>     doc =
>> xml.dom.minidom.parseString(string)
>>   File "/usr/lib/python2.6/xml/dom/minidom.py", line 1928, in
>> parseString
>>     return
>> expatbuilder.parseString(string)
>>   File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 940, in
>> parseString
>>     return
>> builder.parseString(string)
>>   File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 223, in
>> parseString
>>     parser.Parse(string,
>> True)
>> ExpatError: syntax error: line 1, column
>> 0
>> 
>> PID: 1822 cli.py:129 - ERROR - Oops! Looks like you've found a bug in
>> StarCluster
>> PID: 1822 cli.py:130 - ERROR - Debug file written to:
>> /tmp/starcluster-debug-staruser.log
>> PID: 1822 cli.py:131 - ERROR - Look for lines starting with PID:
>> 1822
>> PID: 1822 cli.py:132 - ERROR - Please submit this file, minus any private
>> information,
>> PID: 1822 cli.py:133 - ERROR - to
>> starcluster at mit.edu
>> PID: 1822 ssh.py:534 - DEBUG - __del__
>> called
>> PID: 1822 ssh.py:534 - DEBUG - __del__
>> called
>> 
>> _______________________________________________
>> StarCluster mailing list
>> StarCluster at mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster
>> 
>> 
> 
> _______________________________________________
> StarCluster mailing list
> StarCluster at mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster





More information about the StarCluster mailing list