[StarCluster] loadbalance error
Rayson Ho
raysonlogin at yahoo.com
Tue Jan 17 21:28:19 EST 2012
Hi Wei,
I've identified the issue - and thanks for sending us the log (Justin's crash log tip is helpful!).
So in your log file:
2012-01-17 22:45:35,668 PID: 9177 ssh.py:531 - ERROR - command 'source /etc/profile && qhost -xml' failed with status 1
2012-01-17 22:45:35,669 PID: 9177 ssh.py:536 - DEBUG - error: failed receiving gdi request response for mid=1 (got syncron message receive timeout error).
When Grid Engine (Open Grid Scheduler) clients try talk to the master but get nothing, then the
communication library will print the "failed receiving gdi request response for mid=1" error
message and the client will exit with non 0.
I think a few things could have caused this, like lost packets, master too busy, etc...
To Wei: Quickest fix before a next StarCluster release:
Use shell script wrappers for qstat, qhost, qacct:
% cd /opt/sge6/bin/linux-x64/
% mv qhost qhost.real
... etc
And then write a shell script that calls qhost.real, and if the exit status is non zero, sleep for
10 seconds, and then try again. Note that you may want to break out of the loop if every single
time qhost exits with non-zero.
(Google for shell wrappers and you will see many wrapper examples - should be around 10 lines of
bash.)
To Justin & Rajat: IMO, the quick fix is to handle the error in the load balancer, as communication
errors can occur at any time and the load balancer should handle failures transparently.
I think we should add some code in get_stats() & run(), such that when an error
occurs in qhost, qstat, or qacct, then the error is handled.
In get_stats():
qhostxml = '\n'.join(master.ssh.execute(...))
if master.ssh.get_last_status() != 0:
return 1
qstatxml = '\n'.join(master.ssh.execute(...))
if master.ssh.get_last_status() != 0:
return 1
qacct = '\n'.join(master.ssh.execute(...))
if master.ssh.get_last_status() != 0:
return 1
In run()
ret = self.get_stats()
if ret == 1:
time.sleep(self.polling_interval)
continue
if ret == -1:
log.error("Failed to get stats. LoadBalancer is terminating")
return
Rayson
=================================
Open Grid Scheduler / Grid Engine
http://gridscheduler.sourceforge.net/
Scalable Grid Engine Support Program
http://www.scalablelogic.com/
________________________________
From: Wei Tao <wei.tao at tsibiocomputing.com>
To: Rayson Ho <raysonlogin at yahoo.com>; Justin Riley <justin.t.riley at gmail.com>
Cc: starcluster at mit.edu
Sent: Tuesday, January 17, 2012 6:21 PM
Subject: Re: [StarCluster] loadbalance error
Hi Rayson & Justin,
Attached please find the crash report generated by the loadbalance and another the output of the qhost -xml running on the master node. Hopefully these provide clue on what went wrong.
Thanks for the help!
-Wei
On Wed, Jan 11, 2012 at 1:39 PM, Rayson Ho <raysonlogin at yahoo.com> wrote:
The XML parser does not like the output of "qhost -xml". (We changed some minor XML output code in Grid Scheduler recently,
>but as you have encountered this before in earlier versions, looks like our changes are not the cause of this issue.)
>
>
>I just started a 1 node cluster and let the loadbalancer add another node, and it all seemed to work fine...from the error message
>in your email, qhost exited with 1, and a number of things can cause qhost to exit with code 1.
>
>
>Can you run from the interactive shell the following command on one of the nodes on EC2 when you encounter this problem
>again??
>
>% qhost -xml
>
>And then send us the output. It can be an issue related to how the XML is generated in Grid Engine/Grid Scheduler, or it can be
>something else in the XML parser.
>
>Rayson
>
>=================================
>Open Grid Scheduler / Grid Engine
>http://gridscheduler.sourceforge.net/
>
>Scalable Grid Engine Support Program
>http://www.scalablelogic.com/
>
>
>
>________________________________
>From: Wei Tao <wei.tao at tsibiocomputing.com>
>To: starcluster at mit.edu
>Sent: Wednesday, January 11, 2012 10:01 AM
>Subject: [StarCluster] loadbalance error
>
>
>
>Hi all,
>
>I was running loadbalance. After a while, I got the following error. Can someone shed some light on this? This happened before with earlier versions of Starcluster as well.
>
>>>> Loading full job history
>!!! ERROR - command 'source /etc/profile && qhost -xml' failed with status 1
>Traceback (most recent call last):
> File "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.py", line 251, in main
> sc.execute(args)
> File "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/commands/loadbalance.py", line 89, in execute
> lb.run(cluster)
> File "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py", line 583, in run
> if self.get_stats() == -1:
> File "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py", line 529, in get_stats
> self.stat.parse_qhost(qhostxml)
> File "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py", line 49, in parse_qhost
> doc = xml.dom.minidom.parseString(string)
> File "/usr/lib/python2.6/xml/dom/minidom.py", line 1928, in parseString
> return expatbuilder.parseString(string)
> File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 940, in parseString
> return builder.parseString(string)
> File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 223, in parseString
> parser.Parse(string, True)
>ExpatError: syntax error: line 1, column 0
>
>---------------------------------------------------------------------------
>MemoryError Traceback (most recent call last)
>
>/usr/local/bin/starcluster in <module>()
> 7 if __name__ == '__main__':
> 8 sys.exit(
>----> 9 load_entry_point('StarCluster==0.93', 'console_scripts', 'starcluster')()
> 10 )
> 11
>
>/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc in main()
> 306 logger.configure_sc_logging()
> 307 warn_debug_file_moved()
>--> 308 StarClusterCLI().main()
> 309
> 310 if __name__ == '__main__':
>
>/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc in main(self)
> 283 log.debug(traceback.format_exc())
> 284 print
>--> 285 self.bug_found()
> 286
> 287
>
>/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc in bug_found(self)
> 150 crashfile = open(static.CRASH_FILE, 'w')
> 151 crashfile.write(header % "CRASH DETAILS")
>--> 152 crashfile.write(session.stream.getvalue())
> 153 crashfile.write(header % "SYSTEM INFO")
> 154 crashfile.write("StarCluster: %s\n" % __version__)
>
>/usr/lib/python2.6/StringIO.pyc in getvalue(self)
> 268 """
> 269 if self.buflist:
>--> 270 self.buf += ''.join(self.buflist)
> 271 self.buflist = []
> 272 return self.buf
>
>MemoryError:
>
>Thanks!
>
>-Wei
>_______________________________________________
>StarCluster mailing list
>StarCluster at mit.edu
>http://mailman.mit.edu/mailman/listinfo/starcluster
>
--
Wei Tao, Ph.D.
TSI Biocomputing LLC
617-564-0934
More information about the StarCluster
mailing list