[StarCluster] loadbalance error

Wei Tao wei.tao at tsibiocomputing.com
Tue Jan 17 18:21:33 EST 2012


Hi Rayson & Justin,

Attached please find the crash report generated by the loadbalance and
another the output  of the qhost -xml running on the master node. Hopefully
these provide clue on what went wrong.

Thanks for the help!

-Wei

On Wed, Jan 11, 2012 at 1:39 PM, Rayson Ho <raysonlogin at yahoo.com> wrote:

> The XML parser does not like the output of "qhost -xml". (We changed some
> minor XML output code in Grid Scheduler recently,
> but as you have encountered this before in earlier versions, looks like
> our changes are not the cause of this issue.)
>
>
> I just started a 1 node cluster and let the loadbalancer add another node,
> and it all seemed to work fine...from the error message
> in your email, qhost exited with 1, and a number of things can cause qhost
> to exit with code 1.
>
>
> Can you run from the interactive shell the following command on one of the
> nodes on EC2 when you encounter this problem
> again??
>
> % qhost -xml
>
> And then send us the output. It can be an issue related to how the XML is
> generated in Grid Engine/Grid Scheduler, or it can be
> something else in the XML parser.
>
> Rayson
>
> =================================
> Open Grid Scheduler / Grid Engine
> http://gridscheduler.sourceforge.net/
>
> Scalable Grid Engine Support Program
> http://www.scalablelogic.com/
>
>
>
> ________________________________
> From: Wei Tao <wei.tao at tsibiocomputing.com>
> To: starcluster at mit.edu
> Sent: Wednesday, January 11, 2012 10:01 AM
> Subject: [StarCluster] loadbalance error
>
>
> Hi all,
>
> I was running loadbalance. After a while, I got the following error. Can
> someone shed some light on this? This happened before with earlier versions
> of Starcluster as well.
>
> >>> Loading full job history
> !!! ERROR - command 'source /etc/profile && qhost -xml' failed with status
> 1
> Traceback (most recent call last):
>   File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.py",
> line 251, in main
>     sc.execute(args)
>   File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/commands/loadbalance.py",
> line 89, in execute
>     lb.run(cluster)
>   File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py",
> line 583, in run
>     if self.get_stats() == -1:
>   File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py",
> line 529, in get_stats
>     self.stat.parse_qhost(qhostxml)
>   File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py",
> line 49, in parse_qhost
>     doc = xml.dom.minidom.parseString(string)
>   File "/usr/lib/python2.6/xml/dom/minidom.py", line 1928, in parseString
>     return expatbuilder.parseString(string)
>   File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 940, in
> parseString
>     return builder.parseString(string)
>   File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 223, in
> parseString
>     parser.Parse(string, True)
> ExpatError: syntax error: line 1, column 0
>
> ---------------------------------------------------------------------------
> MemoryError                               Traceback (most recent call last)
>
> /usr/local/bin/starcluster in <module>()
>       7 if __name__ == '__main__':
>       8     sys.exit(
> ----> 9         load_entry_point('StarCluster==0.93', 'console_scripts',
> 'starcluster')()
>      10     )
>      11
>
> /usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc
> in main()
>     306     logger.configure_sc_logging()
>     307     warn_debug_file_moved()
> --> 308     StarClusterCLI().main()
>     309
>     310 if __name__ == '__main__':
>
> /usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc
> in main(self)
>     283             log.debug(traceback.format_exc())
>     284             print
> --> 285             self.bug_found()
>     286
>     287
>
> /usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc
> in bug_found(self)
>     150         crashfile = open(static.CRASH_FILE, 'w')
>     151         crashfile.write(header % "CRASH DETAILS")
> --> 152         crashfile.write(session.stream.getvalue())
>     153         crashfile.write(header % "SYSTEM INFO")
>     154         crashfile.write("StarCluster: %s\n" % __version__)
>
> /usr/lib/python2.6/StringIO.pyc in getvalue(self)
>     268         """
>     269         if self.buflist:
> --> 270             self.buf += ''.join(self.buflist)
>     271             self.buflist = []
>     272         return self.buf
>
> MemoryError:
>
> Thanks!
>
> -Wei
> _______________________________________________
> StarCluster mailing list
> StarCluster at mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>



-- 
Wei Tao, Ph.D.
TSI Biocomputing LLC
617-564-0934
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20120117/ea02f198/attachment-0001.htm
-------------- next part --------------
A non-text attachment was scrubbed...
Name: crash-report-9177.txt.gz
Type: application/x-gzip
Size: 161505 bytes
Desc: not available
Url : http://mailman.mit.edu/pipermail/starcluster/attachments/20120117/ea02f198/attachment-0002.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: crash_report_from_qhost_xml.gz
Type: application/x-gzip
Size: 507 bytes
Desc: not available
Url : http://mailman.mit.edu/pipermail/starcluster/attachments/20120117/ea02f198/attachment-0003.bin


More information about the StarCluster mailing list