[StarCluster] starcluster Failed to add new host but continues addings hosts when using loadbalance

Ryan Golhar ngsbioinformatics at gmail.com
Thu Jul 18 10:13:21 EDT 2013


Hi all - I'm running starcluster loadbalance, and noticed when it runs into
a problem adding a node, it ignores the problem and continues adding nodes
which also fail...ie the node get started by never added to the SGE grid.
 Perhaps a check that all was successful would be in order prior to adding
more nodes:

/opt/sge6
>>> Mounting all NFS export path(s) on 1 worker node(s)
1/1 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100%
>>> Setting up NFS took 0.018 mins
!!! ERROR - Error occured while running plugin
'starcluster.plugins.sge.SGEPlugin':
!!! ERROR - Failed to add new host
Traceback (most recent call last):
  File
"/usr/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/balancers/sge/__init__.py",
line 685, in _eval_add_node
    self._cluster.add_nodes(need_to_add)
  File
"/usr/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/cluster.py",
line 892, in add_nodes
    self.run_plugins(method_name="on_add_node", node=node)
  File
"/usr/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/cluster.py",
line 1527, in run_plugins
    self.run_plugin(plug, method_name=method_name, node=node)
  File
"/usr/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/cluster.py",
line 1552, in run_plugin
    func(*args)
  File
"/usr/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/plugins/sge.py",
line 145, in on_add_node
    self._add_to_sge(node)
  File
"/usr/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/plugins/sge.py",
line 30, in _add_to_sge
    self._inst_sge(node, exec_host=True)
  File
"/usr/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/plugins/sge.py",
line 69, in _inst_sge
    node.ssh.execute(inst_sge, silent=True, only_printable=True)
  File
"/usr/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/sshutils/__init__.py",
line 538, in execute
    msg, command, exit_status, out_str)
RemoteCommandFailed: remote command 'source /etc/profile && cd /opt/sge6 &&
TERM=rxvt ./inst_sge -x -noremote -auto ./ec2_sge.conf' failed with status
1:
Reading configuration from file ./ec2_sge.conf
[H[2J
>>> Sleeping...(looping again in 60 secs)

Execution hosts: 20
Queued jobs: 37
Oldest queued job: 2013-07-04 13:46:50
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20130718/d6d9435d/attachment.htm


More information about the StarCluster mailing list