<div dir="ltr"><div><div><div>Hi Amanda,<br>It looks like you cannot communicate with the master node anymore. The error message is because starcluster failed to execute a simple 'source /etc/profile/' command with a 'connection refused' error. <br><br>Can you paste us the output of the following two commands:<br><br></div>> starcluster listclusters (should list status of all your active clusters and running nodes)<br><br></div>> starcluster sshmaster <your cluster name> (i'm expecting this to fail)<br><br></div>Raj <br></div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Sep 22, 2014 at 5:13 PM, Amanda Joy Kedaigle <span dir="ltr"><<a href="mailto:mandyjoy@mit.edu" target="_blank">mandyjoy@mit.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div>
<div style="direction:ltr;font-family:Tahoma;color:#000000;font-size:10pt">Hi,
<br>
<br>
I am trying to run starcluster's loadbalancer to keep only one node running until jobs are submitted to the cluster. I know it's an experimental feature, but I'm wondering if anyone has run into this error before, or has any suggestions. The cluster has been
whittled down to 1 node after a weekend of inactivity, and now it seems that when jobs are submitted to the queue, instead of adding nodes, SGE fails.<br>
<br>
>>> Loading full job history<br>
*** WARNING - Failed to retrieve stats (1/5):<br>
Traceback (most recent call last):<br>
File "/net/dorsal/apps/python2.7/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/balancers/sge/__init__.py", line 552, in get_stats<br>
return self._get_stats()<br>
File "/net/dorsal/apps/python2.7/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/balancers/sge/__init__.py", line 522, in _get_stats<br>
qhostxml = '\n'.join(master.ssh.execute('qhost -xml'))<br>
File "/net/dorsal/apps/python2.7/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/sshutils.py", line 578, in execute<br>
msg, command, exit_status, out_str)<br>
RemoteCommandFailed: remote command 'source /etc/profile && qhost -xml' failed with status 1:<br>
error: commlib error: got select error (Connection refused)<br>
error: unable to send message to qmaster using port 63231 on host "master": got send error<br>
<br>
Thanks for any help!<span class="HOEnZb"><font color="#888888"><br>
Amanda<br>
</font></span></div>
</div>
<br>_______________________________________________<br>
StarCluster mailing list<br>
<a href="mailto:StarCluster@mit.edu">StarCluster@mit.edu</a><br>
<a href="http://mailman.mit.edu/mailman/listinfo/starcluster" target="_blank">http://mailman.mit.edu/mailman/listinfo/starcluster</a><br>
<br></blockquote></div><br></div>