Thank you, Rayson. I will watch out if the error happens again and run the command that you suggested.<div><br></div><div>On the other hand, I encountered another odd behavior of loadbalance. It seems that when loadbalance attempts to remove nodes, there is a timing gap between the node is marked to be removed and it's actually being inaccessible to job submission. In most cases it worked fine, but today a job was submitted to the node *after* it's marked for removal by the loadbalance. So the node was terminated by loadbalance, but the job was submitted to this node before it's killed and that job shows up on that node in qstat with "auo" states. When I tried to remove that node again by explicitly use the removenode command, it failed because the node is no longer there. I understand that loadbalance is still experimental. But it seems a good idea to tighten the timing of events so that a node is off limits to further job submission at the exact moment it is marked to be removed by loadbalance. Any gap may have unintended side effects.</div>
<div><br></div><div>Thanks!</div><div><br></div><div>-Wei </div><div><br><br><div class="gmail_quote">On Wed, Jan 11, 2012 at 1:39 PM, Rayson Ho <span dir="ltr"><<a href="mailto:raysonlogin@yahoo.com">raysonlogin@yahoo.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">The XML parser does not like the output of "qhost -xml". (We changed some minor XML output code in Grid Scheduler recently,<br>
but as you have encountered this before in earlier versions, looks like our changes are not the cause of this issue.)<br>
<br>
<br>
I just started a 1 node cluster and let the loadbalancer add another node, and it all seemed to work fine...from the error message<br>
in your email, qhost exited with 1, and a number of things can cause qhost to exit with code 1.<br>
<br>
<br>
Can you run from the interactive shell the following command on one of the nodes on EC2 when you encounter this problem<br>
again??<br>
<br>
% qhost -xml<br>
<br>
And then send us the output. It can be an issue related to how the XML is generated in Grid Engine/Grid Scheduler, or it can be<br>
something else in the XML parser.<br>
<br>
Rayson<br>
<br>
=================================<br>
Open Grid Scheduler / Grid Engine<br>
<a href="http://gridscheduler.sourceforge.net/" target="_blank">http://gridscheduler.sourceforge.net/</a><br>
<br>
Scalable Grid Engine Support Program<br>
<a href="http://www.scalablelogic.com/" target="_blank">http://www.scalablelogic.com/</a><br>
<br>
<br>
<br>
________________________________<br>
From: Wei Tao <<a href="mailto:wei.tao@tsibiocomputing.com">wei.tao@tsibiocomputing.com</a>><br>
To: <a href="mailto:starcluster@mit.edu">starcluster@mit.edu</a><br>
Sent: Wednesday, January 11, 2012 10:01 AM<br>
Subject: [StarCluster] loadbalance error<br>
<div><div class="h5"><br>
<br>
Hi all,<br>
<br>
I was running loadbalance. After a while, I got the following error. Can someone shed some light on this? This happened before with earlier versions of Starcluster as well.<br>
<br>
>>> Loading full job history<br>
!!! ERROR - command 'source /etc/profile && qhost -xml' failed with status 1<br>
Traceback (most recent call last):<br>
File "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.py", line 251, in main<br>
sc.execute(args)<br>
File "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/commands/loadbalance.py", line 89, in execute<br>
lb.run(cluster)<br>
File "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py", line 583, in run<br>
if self.get_stats() == -1:<br>
File "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py", line 529, in get_stats<br>
self.stat.parse_qhost(qhostxml)<br>
File "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py", line 49, in parse_qhost<br>
doc = xml.dom.minidom.parseString(string)<br>
File "/usr/lib/python2.6/xml/dom/minidom.py", line 1928, in parseString<br>
return expatbuilder.parseString(string)<br>
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 940, in parseString<br>
return builder.parseString(string)<br>
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 223, in parseString<br>
parser.Parse(string, True)<br>
ExpatError: syntax error: line 1, column 0<br>
<br>
---------------------------------------------------------------------------<br>
MemoryError Traceback (most recent call last)<br>
<br>
/usr/local/bin/starcluster in <module>()<br>
7 if __name__ == '__main__':<br>
8 sys.exit(<br>
----> 9 load_entry_point('StarCluster==0.93', 'console_scripts', 'starcluster')()<br>
10 )<br>
11 <br>
<br>
/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc in main()<br>
306 logger.configure_sc_logging()<br>
307 warn_debug_file_moved()<br>
--> 308 StarClusterCLI().main()<br>
309 <br>
310 if __name__ == '__main__':<br>
<br>
/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc in main(self)<br>
283 log.debug(traceback.format_exc())<br>
284 print<br>
--> 285 self.bug_found()<br>
286 <br>
287 <br>
<br>
/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc in bug_found(self)<br>
150 crashfile = open(static.CRASH_FILE, 'w')<br>
151 crashfile.write(header % "CRASH DETAILS")<br>
--> 152 crashfile.write(session.stream.getvalue())<br>
153 crashfile.write(header % "SYSTEM INFO")<br>
154 crashfile.write("StarCluster: %s\n" % __version__)<br>
<br>
/usr/lib/python2.6/StringIO.pyc in getvalue(self)<br>
268 """<br>
269 if self.buflist:<br>
--> 270 self.buf += ''.join(self.buflist)<br>
271 self.buflist = []<br>
272 return self.buf<br>
<br>
MemoryError: <br>
<br>
Thanks!<br>
<br>
-Wei<br>
</div></div>_______________________________________________<br>
StarCluster mailing list<br>
<a href="mailto:StarCluster@mit.edu">StarCluster@mit.edu</a><br>
<a href="http://mailman.mit.edu/mailman/listinfo/starcluster" target="_blank">http://mailman.mit.edu/mailman/listinfo/starcluster</a><br>
</blockquote></div><br><br clear="all"><div><br></div>-- <br>Wei Tao, Ph.D.<br>TSI Biocomputing LLC<br>617-564-0934<br>
</div>