Hi Rayson & Justin,<div><br></div><div>Attached please find the crash report generated by the loadbalance and another the output of the qhost -xml running on the master node. Hopefully these provide clue on what went wrong.<div>
<br></div><div>Thanks for the help!</div><div><br></div><div>-Wei<br><br><div class="gmail_quote">On Wed, Jan 11, 2012 at 1:39 PM, Rayson Ho <span dir="ltr"><<a href="mailto:raysonlogin@yahoo.com">raysonlogin@yahoo.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">The XML parser does not like the output of "qhost -xml". (We changed some minor XML output code in Grid Scheduler recently,<br>
but as you have encountered this before in earlier versions, looks like our changes are not the cause of this issue.)<br>
<br>
<br>
I just started a 1 node cluster and let the loadbalancer add another node, and it all seemed to work fine...from the error message<br>
in your email, qhost exited with 1, and a number of things can cause qhost to exit with code 1.<br>
<br>
<br>
Can you run from the interactive shell the following command on one of the nodes on EC2 when you encounter this problem<br>
again??<br>
<br>
% qhost -xml<br>
<br>
And then send us the output. It can be an issue related to how the XML is generated in Grid Engine/Grid Scheduler, or it can be<br>
something else in the XML parser.<br>
<br>
Rayson<br>
<br>
=================================<br>
Open Grid Scheduler / Grid Engine<br>
<a href="http://gridscheduler.sourceforge.net/" target="_blank">http://gridscheduler.sourceforge.net/</a><br>
<br>
Scalable Grid Engine Support Program<br>
<a href="http://www.scalablelogic.com/" target="_blank">http://www.scalablelogic.com/</a><br>
<br>
<br>
<br>
________________________________<br>
From: Wei Tao <<a href="mailto:wei.tao@tsibiocomputing.com">wei.tao@tsibiocomputing.com</a>><br>
To: <a href="mailto:starcluster@mit.edu">starcluster@mit.edu</a><br>
Sent: Wednesday, January 11, 2012 10:01 AM<br>
Subject: [StarCluster] loadbalance error<br>
<div><div class="h5"><br>
<br>
Hi all,<br>
<br>
I was running loadbalance. After a while, I got the following error. Can someone shed some light on this? This happened before with earlier versions of Starcluster as well.<br>
<br>
>>> Loading full job history<br>
!!! ERROR - command 'source /etc/profile && qhost -xml' failed with status 1<br>
Traceback (most recent call last):<br>
File "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.py", line 251, in main<br>
sc.execute(args)<br>
File "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/commands/loadbalance.py", line 89, in execute<br>
lb.run(cluster)<br>
File "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py", line 583, in run<br>
if self.get_stats() == -1:<br>
File "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py", line 529, in get_stats<br>
self.stat.parse_qhost(qhostxml)<br>
File "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py", line 49, in parse_qhost<br>
doc = xml.dom.minidom.parseString(string)<br>
File "/usr/lib/python2.6/xml/dom/minidom.py", line 1928, in parseString<br>
return expatbuilder.parseString(string)<br>
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 940, in parseString<br>
return builder.parseString(string)<br>
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 223, in parseString<br>
parser.Parse(string, True)<br>
ExpatError: syntax error: line 1, column 0<br>
<br>
---------------------------------------------------------------------------<br>
MemoryError Traceback (most recent call last)<br>
<br>
/usr/local/bin/starcluster in <module>()<br>
7 if __name__ == '__main__':<br>
8 sys.exit(<br>
----> 9 load_entry_point('StarCluster==0.93', 'console_scripts', 'starcluster')()<br>
10 )<br>
11 <br>
<br>
/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc in main()<br>
306 logger.configure_sc_logging()<br>
307 warn_debug_file_moved()<br>
--> 308 StarClusterCLI().main()<br>
309 <br>
310 if __name__ == '__main__':<br>
<br>
/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc in main(self)<br>
283 log.debug(traceback.format_exc())<br>
284 print<br>
--> 285 self.bug_found()<br>
286 <br>
287 <br>
<br>
/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc in bug_found(self)<br>
150 crashfile = open(static.CRASH_FILE, 'w')<br>
151 crashfile.write(header % "CRASH DETAILS")<br>
--> 152 crashfile.write(session.stream.getvalue())<br>
153 crashfile.write(header % "SYSTEM INFO")<br>
154 crashfile.write("StarCluster: %s\n" % __version__)<br>
<br>
/usr/lib/python2.6/StringIO.pyc in getvalue(self)<br>
268 """<br>
269 if self.buflist:<br>
--> 270 self.buf += ''.join(self.buflist)<br>
271 self.buflist = []<br>
272 return self.buf<br>
<br>
MemoryError: <br>
<br>
Thanks!<br>
<br>
-Wei<br>
</div></div>_______________________________________________<br>
StarCluster mailing list<br>
<a href="mailto:StarCluster@mit.edu">StarCluster@mit.edu</a><br>
<a href="http://mailman.mit.edu/mailman/listinfo/starcluster" target="_blank">http://mailman.mit.edu/mailman/listinfo/starcluster</a><br>
</blockquote></div><br><br clear="all"><div><br></div>-- <br>Wei Tao, Ph.D.<br>TSI Biocomputing LLC<br>617-564-0934<br>
</div></div>