<div dir="ltr"><div style>Hi all - I'm running starcluster loadbalance, and noticed when it runs into a problem adding a node, it ignores the problem and continues adding nodes which also fail...ie the node get started by never added to the SGE grid. Perhaps a check that all was successful would be in order prior to adding more nodes:</div>
<div><br></div><div>/opt/sge6</div><div>>>> Mounting all NFS export path(s) on 1 worker node(s)</div><div>1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%</div><div>>>> Setting up NFS took 0.018 mins</div>
<div>!!! ERROR - Error occured while running plugin 'starcluster.plugins.sge.SGEPlugin':</div><div>!!! ERROR - Failed to add new host</div><div>Traceback (most recent call last):</div><div> File "/usr/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/balancers/sge/__init__.py", line 685, in _eval_add_node</div>
<div> self._cluster.add_nodes(need_to_add)</div><div> File "/usr/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/cluster.py", line 892, in add_nodes</div><div> self.run_plugins(method_name="on_add_node", node=node)</div>
<div> File "/usr/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/cluster.py", line 1527, in run_plugins</div><div> self.run_plugin(plug, method_name=method_name, node=node)</div><div> File "/usr/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/cluster.py", line 1552, in run_plugin</div>
<div> func(*args)</div><div> File "/usr/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/plugins/sge.py", line 145, in on_add_node</div><div> self._add_to_sge(node)</div><div> File "/usr/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/plugins/sge.py", line 30, in _add_to_sge</div>
<div> self._inst_sge(node, exec_host=True)</div><div> File "/usr/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/plugins/sge.py", line 69, in _inst_sge</div><div> node.ssh.execute(inst_sge, silent=True, only_printable=True)</div>
<div> File "/usr/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/sshutils/__init__.py", line 538, in execute</div><div> msg, command, exit_status, out_str)</div><div>RemoteCommandFailed: remote command 'source /etc/profile && cd /opt/sge6 && TERM=rxvt ./inst_sge -x -noremote -auto ./ec2_sge.conf' failed with status 1:</div>
<div>Reading configuration from file ./ec2_sge.conf</div><div>[H[2J</div><div>>>> Sleeping...(looping again in 60 secs)</div><div><br></div><div>Execution hosts: 20</div><div>Queued jobs: 37</div><div>Oldest queued job: 2013-07-04 13:46:50</div>
<div><br></div></div>