[StarCluster] Loadbalancer error - node does not exist
Jacob barhak
jacob.barhak at gmail.com
Sat Oct 4 23:40:30 EDT 2014
Hi Amanda,
Did you check your I/O when running your application?
In the past I had too much traffic to the NFS due to many files and it caused things to slow down considerably.
It may not be your issue, yet it is worth checking anyway.
I hope you resolve your issue regardless.
Jacob
Sent from my iPhone
On Oct 4, 2014, at 1:07 PM, Amanda Joy Kedaigle <mandyjoy at mit.edu> wrote:
> Update: It seems like it might start happening whenever the cluster gets up to maximum capacity, which is 16 nodes. Any ideas of what to look for would be appreciated, this is getting expensive.
>
> Amanda
>
> From: Amanda Joy Kedaigle
> Sent: Thursday, October 02, 2014 12:36 PM
> To: starcluster at mit.edu
> Subject: Loadbalancer error - node does not exist
>
> Hi, I'm running the Elastic LoadBalancer to keep our cluster down to one node when we're not using it, and then ramp up as needed. Generally (i.e. when I run tests and watch it), it works just fine. But twice now, we've had it fail to remove nodes overnight and give the following error, leaving the cluster at full blast with no jobs to run. It says the nodes don't exist, but they are there both on the AWS EC2 console and when I run qhost on the cluster. Any ideas as to the cause? Thanks!
>
> >>> Removing node013 from SGE
>
> !!! ERROR - Error occured while running plugin 'starcluster.plugins.sge.SGEPlugin':
>
> !!! ERROR - Failed to remove node node013
>
> Traceback (most recent call last):
>
> File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/balancers/sge/__init__.py", line 754, in _eval_remove_node
>
> self._cluster.remove_node(node)
>
> File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/cluster.py", line 1050, in remove_node
>
> force=force)
>
> File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/cluster.py", line 1076, in remove_nodes
>
> reverse=True)
>
> File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/cluster.py", line 1690, in run_plugins
>
> self.run_plugin(plug, method_name=method_name, node=node)
>
> File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/cluster.py", line 1715, in run_plugin
>
> func(*args)
>
> File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/plugins/sge.py", line 204, in on_remove_node
>
> self._remove_from_sge(node)
>
> File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/plugins/sge.py", line 166, in _remove_from_sge
>
> master.ssh.execute('qconf -de %s' % node.alias)
>
> File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/sshutils.py", line 579, in execute
>
> msg, command, exit_status, out_str)
>
> RemoteCommandFailed: remote command 'source /etc/profile && qconf -de node013' failed with status 1:
>
> denied: execution host "node013" does not exist
>
> _______________________________________________
> StarCluster mailing list
> StarCluster at mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20141004/b04fd293/attachment.htm
More information about the StarCluster
mailing list