[StarCluster] loadbalance error

Wed Jan 11 18:27:17 EST 2012

I don't think it is a memory leak within the load balancer. elb does not
endlessly add to the host queue, see the first few lines of parse_qhost:

def parse_qhost(self, string):
        """
        this function parses qhost -xml output and makes a neat array
        takes in a string, so we can pipe in output from ssh.exec('qhost -xml')
        """

        self.hosts = []  # clear the old hosts

https://github.com/jtriley/StarCluster/blob/develop/starcluster/balancers/sge/__init__.py

Looks to be an XML parser issue. Maybe ELB is using the parser incorrectly.
How truly huge is the qhosts -xml output? I would think it would be about 1
xml record per host in the cluster, and the XML output contains many many
parameters about the status of the host.

On Wed, Jan 11, 2012 at 5:09 PM, <starcluster-request at mit.edu> wrote:

> Send StarCluster mailing list submissions to
>        starcluster at mit.edu
>
> To subscribe or unsubscribe via the World Wide Web, visit
>        http://mailman.mit.edu/mailman/listinfo/starcluster
> or, via email, send a message with subject or body 'help' to
>        starcluster-request at mit.edu
>
> You can reach the person managing the list at
>        starcluster-owner at mit.edu
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of StarCluster digest..."
>
> Today's Topics:
>
>   1. Re: loadbalance error (Rayson Ho)
>   2. Re: loadbalance error (Wei Tao)
>   3. Re: loadbalance error (Rayson Ho)
>
>
> ---------- Forwarded message ----------
> From: Rayson Ho <raysonlogin at yahoo.com>
> To: Wei Tao <wei.tao at tsibiocomputing.com>, "starcluster at mit.edu" <
> starcluster at mit.edu>
> Cc:
> Date: Wed, 11 Jan 2012 10:39:18 -0800 (PST)
> Subject: Re: [StarCluster] loadbalance error
> The XML parser does not like the output of "qhost -xml". (We changed some
> minor XML output code in Grid Scheduler recently,
> but as you have encountered this before in earlier versions, looks like
> our changes are not the cause of this issue.)
>
>
> I just started a 1 node cluster and let the loadbalancer add another node,
> and it all seemed to work fine...from the error message
> in your email, qhost exited with 1, and a number of things can cause qhost
> to exit with code 1.
>
>
> Can you run from the interactive shell the following command on one of the
> nodes on EC2 when you encounter this problem
> again??
>
> % qhost -xml
>
> And then send us the output. It can be an issue related to how the XML is
> generated in Grid Engine/Grid Scheduler, or it can be
> something else in the XML parser.
>
> Rayson
>
> =================================
> Open Grid Scheduler / Grid Engine
> http://gridscheduler.sourceforge.net/
>
> Scalable Grid Engine Support Program
> http://www.scalablelogic.com/
>
>
>
> ________________________________
> From: Wei Tao <wei.tao at tsibiocomputing.com>
> To: starcluster at mit.edu
> Sent: Wednesday, January 11, 2012 10:01 AM
> Subject: [StarCluster] loadbalance error
>
>
> Hi all,
>
> I was running loadbalance. After a while, I got the following error. Can
> someone shed some light on this? This happened before with earlier versions
> of Starcluster as well.
>
> >>> Loading full job history
> !!! ERROR - command 'source /etc/profile && qhost -xml' failed with status
> 1
> Traceback (most recent call last):
>   File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.py",
> line 251, in main
>     sc.execute(args)
>   File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/commands/loadbalance.py",
> line 89, in execute
>     lb.run(cluster)
>   File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py",
> line 583, in run
>     if self.get_stats() == -1:
>   File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py",
> line 529, in get_stats
>     self.stat.parse_qhost(qhostxml)
>   File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py",
> line 49, in parse_qhost
>     doc = xml.dom.minidom.parseString(string)
>   File "/usr/lib/python2.6/xml/dom/minidom.py", line 1928, in parseString
>     return expatbuilder.parseString(string)
>   File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 940, in
> parseString
>     return builder.parseString(string)
>   File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 223, in
> parseString
>     parser.Parse(string, True)
> ExpatError: syntax error: line 1, column 0
>
> ---------------------------------------------------------------------------
> MemoryError                               Traceback (most recent call last)
>
> /usr/local/bin/starcluster in <module>()
>       7 if __name__ == '__main__':
>       8     sys.exit(
> ----> 9         load_entry_point('StarCluster==0.93', 'console_scripts',
> 'starcluster')()
>      10     )
>      11
>
> /usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc
> in main()
>     306     logger.configure_sc_logging()
>     307     warn_debug_file_moved()
> --> 308     StarClusterCLI().main()
>     309
>     310 if __name__ == '__main__':
>
> /usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc
> in main(self)
>     283             log.debug(traceback.format_exc())
>     284             print
> --> 285             self.bug_found()
>     286
>     287
>
> /usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc
> in bug_found(self)
>     150         crashfile = open(static.CRASH_FILE, 'w')
>     151         crashfile.write(header % "CRASH DETAILS")
> --> 152         crashfile.write(session.stream.getvalue())
>     153         crashfile.write(header % "SYSTEM INFO")
>     154         crashfile.write("StarCluster: %s\n" % __version__)
>
> /usr/lib/python2.6/StringIO.pyc in getvalue(self)
>     268         """
>     269         if self.buflist:
> --> 270             self.buf += ''.join(self.buflist)
>     271             self.buflist = []
>     272         return self.buf
>
> MemoryError:
>
> Thanks!
>
> -Wei
> _______________________________________________
> StarCluster mailing list
> StarCluster at mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
>
>
> ---------- Forwarded message ----------
> From: Wei Tao <wei.tao at tsibiocomputing.com>
> To: Rayson Ho <raysonlogin at yahoo.com>
> Cc: "starcluster at mit.edu" <starcluster at mit.edu>
> Date: Wed, 11 Jan 2012 15:54:52 -0500
> Subject: Re: [StarCluster] loadbalance error
> Thank you, Rayson. I will watch out if the error happens again and run the
> command that you suggested.
>
> On the other hand, I encountered another odd behavior of loadbalance. It
> seems that when loadbalance attempts to remove nodes, there is a timing gap
> between the node is marked to be removed and it's actually being
> inaccessible to job submission. In most cases it worked fine, but today a
> job was submitted to the node *after* it's marked for removal by the
> loadbalance. So the node was terminated by loadbalance, but the job was
> submitted to this node before it's killed and that job shows up on that
> node in qstat with "auo" states. When I tried to remove that node again by
> explicitly use the removenode command, it failed because the node is no
> longer there. I understand that loadbalance is still experimental. But it
> seems a good idea to tighten the timing of events so that a node is off
> limits to further job submission at the exact moment it is marked to be
> removed by loadbalance. Any gap may have unintended side effects.
>
> Thanks!
>
> -Wei
>
>
> On Wed, Jan 11, 2012 at 1:39 PM, Rayson Ho <raysonlogin at yahoo.com> wrote:
>
>> The XML parser does not like the output of "qhost -xml". (We changed some
>> minor XML output code in Grid Scheduler recently,
>> but as you have encountered this before in earlier versions, looks like
>> our changes are not the cause of this issue.)
>>
>>
>> I just started a 1 node cluster and let the loadbalancer add another
>> node, and it all seemed to work fine...from the error message
>> in your email, qhost exited with 1, and a number of things can cause
>> qhost to exit with code 1.
>>
>>
>> Can you run from the interactive shell the following command on one of
>> the nodes on EC2 when you encounter this problem
>> again??
>>
>> % qhost -xml
>>
>> And then send us the output. It can be an issue related to how the XML is
>> generated in Grid Engine/Grid Scheduler, or it can be
>> something else in the XML parser.
>>
>> Rayson
>>
>> =================================
>> Open Grid Scheduler / Grid Engine
>> http://gridscheduler.sourceforge.net/
>>
>> Scalable Grid Engine Support Program
>> http://www.scalablelogic.com/
>>
>>
>>
>> ________________________________
>> From: Wei Tao <wei.tao at tsibiocomputing.com>
>> To: starcluster at mit.edu
>> Sent: Wednesday, January 11, 2012 10:01 AM
>> Subject: [StarCluster] loadbalance error
>>
>>
>> Hi all,
>>
>> I was running loadbalance. After a while, I got the following error. Can
>> someone shed some light on this? This happened before with earlier versions
>> of Starcluster as well.
>>
>> >>> Loading full job history
>> !!! ERROR - command 'source /etc/profile && qhost -xml' failed with
>> status 1
>> Traceback (most recent call last):
>>   File
>> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.py",
>> line 251, in main
>>     sc.execute(args)
>>   File
>> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/commands/loadbalance.py",
>> line 89, in execute
>>     lb.run(cluster)
>>   File
>> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py",
>> line 583, in run
>>     if self.get_stats() == -1:
>>   File
>> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py",
>> line 529, in get_stats
>>     self.stat.parse_qhost(qhostxml)
>>   File
>> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py",
>> line 49, in parse_qhost
>>     doc = xml.dom.minidom.parseString(string)
>>   File "/usr/lib/python2.6/xml/dom/minidom.py", line 1928, in parseString
>>     return expatbuilder.parseString(string)
>>   File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 940, in
>> parseString
>>     return builder.parseString(string)
>>   File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 223, in
>> parseString
>>     parser.Parse(string, True)
>> ExpatError: syntax error: line 1, column 0
>>
>>
>> ---------------------------------------------------------------------------
>> MemoryError                               Traceback (most recent call
>> last)
>>
>> /usr/local/bin/starcluster in <module>()
>>       7 if __name__ == '__main__':
>>       8     sys.exit(
>> ----> 9         load_entry_point('StarCluster==0.93', 'console_scripts',
>> 'starcluster')()
>>      10     )
>>      11
>>
>> /usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc
>> in main()
>>     306     logger.configure_sc_logging()
>>     307     warn_debug_file_moved()
>> --> 308     StarClusterCLI().main()
>>     309
>>     310 if __name__ == '__main__':
>>
>> /usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc
>> in main(self)
>>     283             log.debug(traceback.format_exc())
>>     284             print
>> --> 285             self.bug_found()
>>     286
>>     287
>>
>> /usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc
>> in bug_found(self)
>>     150         crashfile = open(static.CRASH_FILE, 'w')
>>     151         crashfile.write(header % "CRASH DETAILS")
>> --> 152         crashfile.write(session.stream.getvalue())
>>     153         crashfile.write(header % "SYSTEM INFO")
>>     154         crashfile.write("StarCluster: %s\n" % __version__)
>>
>> /usr/lib/python2.6/StringIO.pyc in getvalue(self)
>>     268         """
>>     269         if self.buflist:
>> --> 270             self.buf += ''.join(self.buflist)
>>     271             self.buflist = []
>>     272         return self.buf
>>
>> MemoryError:
>>
>> Thanks!
>>
>> -Wei
>> _______________________________________________
>> StarCluster mailing list
>> StarCluster at mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>
>
>
> --
> Wei Tao, Ph.D.
> TSI Biocomputing LLC
> 617-564-0934
>
>
> ---------- Forwarded message ----------
> From: Rayson Ho <raysonlogin at yahoo.com>
> To: Wei Tao <wei.tao at tsibiocomputing.com>
> Cc: "starcluster at mit.edu" <starcluster at mit.edu>
> Date: Wed, 11 Jan 2012 14:09:36 -0800 (PST)
> Subject: Re: [StarCluster] loadbalance error
> I think it is a synchronization issue that can be handled by disabling the
> nodes first before removing them in the load balancer, but
> there are other ways to handle this - like for example configure Grid
> Engine to rerun jobs automatically.
>
> Currently, the load balancer code checks whether a node is running jobs
> (in the is_node_working() function), and if nothing is running
> on the node, then the load balancer removes the job by calling
> remove_node().
>
> This morning was the first time I tried the load balancer, and I've only
> spent a quick 10-min look at the balancer code... others may
> have things to add and/or other suggestions.
>
> Rayson
>
> =================================
> Open Grid Scheduler / Grid Engine
> http://gridscheduler.sourceforge.net/
>
> Scalable Grid Engine Support Program
> http://www.scalablelogic.com/
>
>
>
>
> ________________________________
> From: Wei Tao <wei.tao at tsibiocomputing.com>
> To: Rayson Ho <raysonlogin at yahoo.com>
> Cc: "starcluster at mit.edu" <starcluster at mit.edu>
> Sent: Wednesday, January 11, 2012 3:54 PM
> Subject: Re: [StarCluster] loadbalance error
>
>
> Thank you, Rayson. I will watch out if the error happens again and run the
> command that you suggested.
>
> On the other hand, I encountered another odd behavior of loadbalance. It
> seems that when loadbalance attempts to remove nodes, there is a timing gap
> between the node is marked to be removed and it's actually being
> inaccessible to job submission. In most cases it worked fine, but today a
> job was submitted to the node *after* it's marked for removal by the
> loadbalance. So the node was terminated by loadbalance, but the job was
> submitted to this node before it's killed and that job shows up on that
> node in qstat with "auo" states. When I tried to remove that node again by
> explicitly use the removenode command, it failed because the node is no
> longer there. I understand that loadbalance is still experimental. But it
> seems a good idea to tighten the timing of events so that a node is off
> limits to further job submission at the exact moment it is marked to be
> removed by loadbalance. Any gap may have unintended side effects.
>
> Thanks!
>
> -Wei
>
>
>
> On Wed, Jan 11, 2012 at 1:39 PM, Rayson Ho <raysonlogin at yahoo.com> wrote:
>
> The XML parser does not like the output of "qhost -xml". (We changed some
> minor XML output code in Grid Scheduler recently,
> >but as you have encountered this before in earlier versions, looks like
> our changes are not the cause of this issue.)
> >
> >
> >I just started a 1 node cluster and let the loadbalancer add another
> node, and it all seemed to work fine...from the error message
> >in your email, qhost exited with 1, and a number of things can cause
> qhost to exit with code 1.
> >
> >
> >Can you run from the interactive shell the following command on one of
> the nodes on EC2 when you encounter this problem
> >again??
> >
> >% qhost -xml
> >
> >And then send us the output. It can be an issue related to how the XML is
> generated in Grid Engine/Grid Scheduler, or it can be
> >something else in the XML parser.
> >
> >Rayson
> >
> >=================================
> >Open Grid Scheduler / Grid Engine
> >http://gridscheduler.sourceforge.net/
> >
> >Scalable Grid Engine Support Program
> >http://www.scalablelogic.com/
> >
> >
> >
> >________________________________
> >From: Wei Tao <wei.tao at tsibiocomputing.com>
> >To: starcluster at mit.edu
> >Sent: Wednesday, January 11, 2012 10:01 AM
> >Subject: [StarCluster] loadbalance error
> >
> >
> >
> >Hi all,
> >
> >I was running loadbalance. After a while, I got the following error. Can
> someone shed some light on this? This happened before with earlier versions
> of Starcluster as well.
> >
> >>>> Loading full job history
> >!!! ERROR - command 'source /etc/profile && qhost -xml' failed with
> status 1
> >Traceback (most recent call last):
> >  File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.py",
> line 251, in main
> >    sc.execute(args)
> >  File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/commands/loadbalance.py",
> line 89, in execute
> >    lb.run(cluster)
> >  File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py",
> line 583, in run
> >    if self.get_stats() == -1:
> >  File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py",
> line 529, in get_stats
> >    self.stat.parse_qhost(qhostxml)
> >  File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py",
> line 49, in parse_qhost
> >    doc = xml.dom.minidom.parseString(string)
> >  File "/usr/lib/python2.6/xml/dom/minidom.py", line 1928, in parseString
> >    return expatbuilder.parseString(string)
> >  File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 940, in
> parseString
> >    return builder.parseString(string)
> >  File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 223, in
> parseString
> >    parser.Parse(string, True)
> >ExpatError: syntax error: line 1, column 0
> >
>
> >---------------------------------------------------------------------------
> >MemoryError                               Traceback (most recent call
> last)
> >
> >/usr/local/bin/starcluster in <module>()
> >      7 if __name__ == '__main__':
> >      8     sys.exit(
> >----> 9         load_entry_point('StarCluster==0.93', 'console_scripts',
> 'starcluster')()
> >     10     )
> >     11
> >
> >/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc
> in main()
> >    306     logger.configure_sc_logging()
> >    307     warn_debug_file_moved()
> >--> 308     StarClusterCLI().main()
> >    309
> >    310 if __name__ == '__main__':
> >
> >/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc
> in main(self)
> >    283             log.debug(traceback.format_exc())
> >    284             print
> >--> 285             self.bug_found()
> >    286
> >    287
> >
> >/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc
> in bug_found(self)
> >    150         crashfile = open(static.CRASH_FILE, 'w')
> >    151         crashfile.write(header % "CRASH DETAILS")
> >--> 152         crashfile.write(session.stream.getvalue())
> >    153         crashfile.write(header % "SYSTEM INFO")
> >    154         crashfile.write("StarCluster: %s\n" % __version__)
> >
> >/usr/lib/python2.6/StringIO.pyc in getvalue(self)
> >    268         """
> >    269         if self.buflist:
> >--> 270             self.buf += ''.join(self.buflist)
> >    271             self.buflist = []
> >    272         return self.buf
> >
> >MemoryError:
> >
> >Thanks!
> >
> >-Wei
> >_______________________________________________
> >StarCluster mailing list
> >StarCluster at mit.edu
> >http://mailman.mit.edu/mailman/listinfo/starcluster
> >
>
>
> --
> Wei Tao, Ph.D.
> TSI Biocomputing LLC
> 617-564-0934
>
>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster at mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20120111/cd85859c/attachment-0001.htm