[StarCluster] loadbalance error

Tue Jan 17 21:28:19 EST 2012

Hi Wei,

I've identified the issue - and thanks for sending us the log (Justin's crash log tip is helpful!).

So in your log file:

2012-01-17 22:45:35,668 PID: 9177 ssh.py:531 - ERROR - command 'source /etc/profile && qhost -xml' failed with status 1
2012-01-17 22:45:35,669 PID: 9177 ssh.py:536 - DEBUG - error: failed receiving gdi request response for mid=1 (got syncron message receive timeout error).

When Grid Engine (Open Grid Scheduler) clients try talk to the master but get nothing, then the
communication library will print the "failed receiving gdi request response for mid=1" error
message and the client will exit with non 0.

I think a few things could have caused this, like lost packets, master too busy, etc...

To Wei: Quickest fix before a next StarCluster release:

Use shell script wrappers for qstat, qhost, qacct:

% cd /opt/sge6/bin/linux-x64/
% mv qhost qhost.real
... etc

And then write a shell script that calls qhost.real, and if the exit status is non zero, sleep for
10 seconds, and then try again. Note that you may want to break out of the loop if every single
time qhost exits with non-zero.

(Google for shell wrappers and you will see many wrapper examples - should be around 10 lines of 
bash.)

To Justin & Rajat: IMO, the quick fix is to handle the error in the load balancer, as communication
errors can occur at any time and the load balancer should handle failures transparently.

I think we should add some code in get_stats() & run(), such that when an error
occurs in qhost, qstat, or qacct, then the error is handled.

In get_stats():

    qhostxml = '\n'.join(master.ssh.execute(...))
    if master.ssh.get_last_status() != 0:
        return 1

    qstatxml = '\n'.join(master.ssh.execute(...))
    if master.ssh.get_last_status() != 0:
        return 1

    qacct = '\n'.join(master.ssh.execute(...))
    if master.ssh.get_last_status() != 0:
        return 1

In run()

    ret = self.get_stats()
    if ret == 1:
       time.sleep(self.polling_interval)
       continue

    if ret == -1:
       log.error("Failed to get stats. LoadBalancer is terminating")
       return

Rayson

=================================
Open Grid Scheduler / Grid Engine
http://gridscheduler.sourceforge.net/

Scalable Grid Engine Support Program
http://www.scalablelogic.com/

________________________________
From: Wei Tao <wei.tao at tsibiocomputing.com>
To: Rayson Ho <raysonlogin at yahoo.com>; Justin Riley <justin.t.riley at gmail.com> 
Cc: starcluster at mit.edu 
Sent: Tuesday, January 17, 2012 6:21 PM
Subject: Re: [StarCluster] loadbalance error

Hi Rayson & Justin,

Attached please find the crash report generated by the loadbalance and another the output  of the qhost -xml running on the master node. Hopefully these provide clue on what went wrong.

Thanks for the help!

-Wei

On Wed, Jan 11, 2012 at 1:39 PM, Rayson Ho <raysonlogin at yahoo.com> wrote:

The XML parser does not like the output of "qhost -xml". (We changed some minor XML output code in Grid Scheduler recently,
>but as you have encountered this before in earlier versions, looks like our changes are not the cause of this issue.)
>
>
>I just started a 1 node cluster and let the loadbalancer add another node, and it all seemed to work fine...from the error message
>in your email, qhost exited with 1, and a number of things can cause qhost to exit with code 1.
>
>
>Can you run from the interactive shell the following command on one of the nodes on EC2 when you encounter this problem
>again??
>
>% qhost -xml
>
>And then send us the output. It can be an issue related to how the XML is generated in Grid Engine/Grid Scheduler, or it can be
>something else in the XML parser.
>
>Rayson
>
>=================================
>Open Grid Scheduler / Grid Engine
>http://gridscheduler.sourceforge.net/
>
>Scalable Grid Engine Support Program
>http://www.scalablelogic.com/
>
>
>
>________________________________
>From: Wei Tao <wei.tao at tsibiocomputing.com>
>To: starcluster at mit.edu
>Sent: Wednesday, January 11, 2012 10:01 AM
>Subject: [StarCluster] loadbalance error
>
>
>
>Hi all,
>
>I was running loadbalance. After a while, I got the following error. Can someone shed some light on this? This happened before with earlier versions of Starcluster as well.
>
>>>> Loading full job history
>!!! ERROR - command 'source /etc/profile && qhost -xml' failed with status 1
>Traceback (most recent call last):
>  File "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.py", line 251, in main
>    sc.execute(args)
>  File "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/commands/loadbalance.py", line 89, in execute
>    lb.run(cluster)
>  File "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py", line 583, in run
>    if self.get_stats() == -1:
>  File "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py", line 529, in get_stats
>    self.stat.parse_qhost(qhostxml)
>  File "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py", line 49, in parse_qhost
>    doc = xml.dom.minidom.parseString(string)
>  File "/usr/lib/python2.6/xml/dom/minidom.py", line 1928, in parseString
>    return expatbuilder.parseString(string)
>  File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 940, in parseString
>    return builder.parseString(string)
>  File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 223, in parseString
>    parser.Parse(string, True)
>ExpatError: syntax error: line 1, column 0
>
>---------------------------------------------------------------------------
>MemoryError                               Traceback (most recent call last)
>
>/usr/local/bin/starcluster in <module>()
>      7 if __name__ == '__main__':
>      8     sys.exit(
>----> 9         load_entry_point('StarCluster==0.93', 'console_scripts', 'starcluster')()
>     10     )
>     11 
>
>/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc in main()
>    306     logger.configure_sc_logging()
>    307     warn_debug_file_moved()
>--> 308     StarClusterCLI().main()
>    309 
>    310 if __name__ == '__main__':
>
>/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc in main(self)
>    283             log.debug(traceback.format_exc())
>    284             print
>--> 285             self.bug_found()
>    286 
>    287 
>
>/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc in bug_found(self)
>    150         crashfile = open(static.CRASH_FILE, 'w')
>    151         crashfile.write(header % "CRASH DETAILS")
>--> 152         crashfile.write(session.stream.getvalue())
>    153         crashfile.write(header % "SYSTEM INFO")
>    154         crashfile.write("StarCluster: %s\n" % __version__)
>
>/usr/lib/python2.6/StringIO.pyc in getvalue(self)
>    268         """
>    269         if self.buflist:
>--> 270             self.buf += ''.join(self.buflist)
>    271             self.buflist = []
>    272         return self.buf
>
>MemoryError: 
>
>Thanks!
>
>-Wei
>_______________________________________________
>StarCluster mailing list
>StarCluster at mit.edu
>http://mailman.mit.edu/mailman/listinfo/starcluster
>

-- 
Wei Tao, Ph.D.
TSI Biocomputing LLC
617-564-0934