<div dir="ltr"><div><div><div><div><div>Sorry, I haven't had time to dig deeply into this. Ironically, it just occurred on my cluster too. The load balancer kept rolling through the errors and only lost one node.<br><br></div>From a superficial analysis, starcluster's add_node code spawns a thread for each new host, to setup the /etc/hosts file. If one fails with an exception, the thread is joined and the exception dumped out somewhere though not where i'd expect.<br><br></div>In my stack trace but not yours, it had this paramiko error:<br><br>>>> Configuring /etc/hosts on each node<br>No handlers could be found for logger "paramiko.transport" | 0% <br>3/3 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100% <br>!!! ERROR - Error occured while running plugin 'starcluster.clustersetup.DefaultClusterSetup':<br>!!! ERROR - Failed to add new host<br>Traceback (most recent call last):<br> File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/StarCluster-0.95.6-py2.7.egg/starcluster/balancers/sge/__init__.py", line 719, in _eval_add_node<br> self._cluster.add_nodes(need_to_add)<br> File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/StarCluster-0.95.6-py2.7.egg/starcluster/cluster.py", line 1042, in add_nodes<br> self.run_plugins(method_name="on_add_node", node=node)<br> File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/StarCluster-0.95.6-py2.7.egg/starcluster/cluster.py", line 1690, in run_plugins<br> self.run_plugin(plug, method_name=method_name, node=node)<br> File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/StarCluster-0.95.6-py2.7.egg/starcluster/cluster.py", line 1715, in run_plugin<br> func(*args)<br> File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/StarCluster-0.95.6-py2.7.egg/starcluster/clustersetup.py", line 425, in on_add_node<br> self._setup_etc_hosts(nodes)<br> File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/StarCluster-0.95.6-py2.7.egg/starcluster/clustersetup.py", line 252, in _setup_etc_hosts<br> self.pool.wait(numtasks=len(nodes))<br> File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/StarCluster-0.95.6-py2.7.egg/starcluster/threadpool.py", line 177, in wait<br> "An error occurred in ThreadPool", excs)<br>ThreadPoolException: An error occurred in ThreadPool<br>>>> Sleeping...(looping again in 60 secs)<br><br><br><br>And stack overflow has a simple idea of how to solve that:<br></div><a href="http://stackoverflow.com/questions/19152578/no-handlers-could-be-found-for-logger-paramiko">http://stackoverflow.com/questions/19152578/no-handlers-could-be-found-for-logger-paramiko</a><br><br></div>It hasn't recurred.<br><br></div>As for your stack trace logs, I see that you're running windows and I think I'll never be able to solve your problems.<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Jun 3, 2015 at 9:19 AM, Avner May <span dir="ltr"><<a href="mailto:avnermay@cs.columbia.edu" target="_blank">avnermay@cs.columbia.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Attached are 2 more logs of load balancer crashes.<div><div class="h5"><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Jun 2, 2015 at 4:27 PM, Avner May <span dir="ltr"><<a href="mailto:avnermay@cs.columbia.edu" target="_blank">avnermay@cs.columbia.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">1) I am using StarCluster version 0.95.6<div><div>C:\Windows\system32>starcluster addnode mycluster</div><span><div>StarCluster - (<a href="http://star.mit.edu/cluster" target="_blank">http://star.mit.edu/cluster</a>) (<b><u>v. 0.95.6</u></b>)</div><div>Software Tools for Academics and Researchers (STAR)</div><div>Please submit bug reports to <a href="mailto:starcluster@mit.edu" target="_blank">starcluster@mit.edu</a></div></span></div><div><br></div><div>2) I did not try to override wait_time</div><div><br></div><div>3) SGE plugin is running</div><div><br></div><div>And this particular failure occurred when the load balancer was trying to add nodes.</div><div><br></div><div>Thanks,<br>Avner</div></div><div><div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Jun 2, 2015 at 3:24 PM, Rajat Banerjee <span dir="ltr"><<a href="mailto:rajatb@post.harvard.edu" target="_blank">rajatb@post.harvard.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div><div>The log line you cited:<span><br><div>Traceback (most recent call last):</div><div> File "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\balancers\sge\__init__.py", line 719, in _eval_add_node</div><br><br></span></div>has this, which is puzzling:<br><span><span><a href="http://log.info" target="_blank">log.info</a>("</span>No queued jobs older than <span>%d</span> seconds<span>"</span></span> <span>%</span>
<span>self</span>.longest_allowed_queue_time)<br><a href="https://github.com/jtriley/StarCluster/blob/develop/starcluster/balancers/sge/__init__.py" target="_blank">https://github.com/jtriley/StarCluster/blob/develop/starcluster/balancers/sge/__init__.py</a><br><br></div>Three questions - <br>1) Are you using an up-to-date version?<br>2) did you try to override wait_time aka longest_allowed_queue_time in your config file or on the load balancer command line? Otherwise it makes very little sense, your stack trace looks like add_node failed, not the load balancer<br></div>3) Any plugins running? <br></div><div class="gmail_extra"><br><div class="gmail_quote"><div><div>On Tue, Jun 2, 2015 at 3:06 PM, Avner May <span dir="ltr"><<a href="mailto:avnermay@cs.columbia.edu" target="_blank">avnermay@cs.columbia.edu</a>></span> wrote:<br></div></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div><div dir="ltr">Hi all,<div><br></div><div>I was writing because I have been having a lot of issues with the load balancer. The most common issue I have is that it fails to remove instances effectively. In a super slow fashion, it goes through the instances it wants to terminate (this pace is frustrating independent of the failure/success of the operation), and one by one fails to terminate each one. Then, I am forced to kill a subset of the nodes in my cluster manually. But this results in the scheduler being confused by how many nodes are actually in the network, so when I later submit jobs to the cluster again, it thinks it has enough nodes to handle that load, and doesn't create new instances. So I am forced to create a ton of dummy jobs (eg, "<span style="font-family:arial,sans,sans-serif;font-size:13px">qsub -V -b y -cwd hostname"), to trick the scheduler into thinking that it has more queued jobs than "available" machines. These issues are quite annoying.</span></div><div><br></div><div>Additionally, just now I had an issue where the load balancer failed to launch a machine:</div><div><br></div><div><div>!!! ERROR - Error occured while running plugin 'starcluster.clustersetup.DefaultClusterSetup':</div><div>!!! ERROR - Failed to add new host</div><div>Traceback (most recent call last):</div><div> File "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\balancers\sge\__init__.py", line 719, in _eval_add_node</div><div> self._cluster.add_nodes(need_to_add)</div><div> File "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\cluster.py", line 1042, in add_nodes</div><div> self.run_plugins(method_name="on_add_node", node=node)</div><div> File "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\cluster.py", line 1690, in run_plugins</div><div> self.run_plugin(plug, method_name=method_name, node=node)</div><div> File "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\cluster.py", line 1715, in run_plugin</div><div> func(*args)</div><div> File "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\clustersetup.py", line 425, in on_add_node</div><div> self._setup_etc_hosts(nodes)</div><div> File "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\clustersetup.py", line 252, in _setup_etc_hosts</div><div> self.pool.wait(numtasks=len(nodes))</div><div> File "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\threadpool.py", line 177, in wait</div><div> "An error occurred in ThreadPool", excs)</div><div>ThreadPoolException: An error occurred in ThreadPool</div><div>>>> Sleeping...(looping again in 60 secs)</div></div><div><br></div><div>After getting this error, for some reason the load balancer stopped recognizing the existance of the cluster:</div><div><br></div><div><div>C:\Windows\system32>starcluster loadbalance --max_nodes=100 --min_nodes=1 --add_nodes_per_iter=17 babel2</div><div>StarCluster - (<a href="http://star.mit.edu/cluster" target="_blank">http://star.mit.edu/cluster</a>) (v. 0.95.6)</div><div>Software Tools for Academics and Researchers (STAR)</div><div>Please submit bug reports to <a href="mailto:starcluster@mit.edu" target="_blank">starcluster@mit.edu</a></div><div><br></div><div>!!! ERROR - cluster babel2 is not running</div></div><div><br></div><div>Is anyone else hitting similar issues with the load balancer?</div><div><br></div><div>Thanks,</div><div>Avner</div></div>
<br></div></div>_______________________________________________<br>
StarCluster mailing list<br>
<a href="mailto:StarCluster@mit.edu" target="_blank">StarCluster@mit.edu</a><br>
<a href="http://mailman.mit.edu/mailman/listinfo/starcluster" target="_blank">http://mailman.mit.edu/mailman/listinfo/starcluster</a><br>
<br></blockquote></div><br></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div></div></div></div>
</blockquote></div><br></div>