[StarCluster] Loadbalancer error - node does not exist

Amanda Joy Kedaigle mandyjoy at mit.edu
Sat Oct 4 14:07:46 EDT 2014


Update: It seems like it might start happening whenever the cluster gets up to maximum capacity, which is 16 nodes. Any ideas of what to look for would be appreciated, this is getting expensive.

Amanda

________________________________
From: Amanda Joy Kedaigle
Sent: Thursday, October 02, 2014 12:36 PM
To: starcluster at mit.edu
Subject: Loadbalancer error - node does not exist

Hi, I'm running the Elastic LoadBalancer to keep our cluster down to one node when we're not using it, and then ramp up as needed. Generally (i.e. when I run tests and watch it), it works just fine. But twice now, we've had it fail to remove nodes overnight and give the following error, leaving the cluster at full blast with no jobs to run. It says the nodes don't exist, but they are there both on the AWS EC2 console and when I run qhost on the cluster. Any ideas as to the cause? Thanks!


>>> Removing node013 from SGE

!!! ERROR - Error occured while running plugin 'starcluster.plugins.sge.SGEPlugin':

!!! ERROR - Failed to remove node node013

Traceback (most recent call last):

  File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/balancers/sge/__init__.py", line 754, in _eval_remove_node

    self._cluster.remove_node(node)

  File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/cluster.py", line 1050, in remove_node

    force=force)

  File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/cluster.py", line 1076, in remove_nodes

    reverse=True)

  File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/cluster.py", line 1690, in run_plugins

    self.run_plugin(plug, method_name=method_name, node=node)

  File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/cluster.py", line 1715, in run_plugin

    func(*args)

  File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/plugins/sge.py", line 204, in on_remove_node

    self._remove_from_sge(node)

  File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/plugins/sge.py", line 166, in _remove_from_sge

    master.ssh.execute('qconf -de %s' % node.alias)

  File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/sshutils.py", line 579, in execute

    msg, command, exit_status, out_str)

RemoteCommandFailed: remote command 'source /etc/profile && qconf -de node013' failed with status 1:

denied: execution host "node013" does not exist
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20141004/d146e45b/attachment.htm


More information about the StarCluster mailing list