[StarCluster] Load Balancer issues

Avner May avnermay at cs.columbia.edu
Thu Jun 4 23:21:59 EDT 2015


*>> As for your stack trace logs, I see that you're running windows and I
think I'll never be able to solve your problems.*
hahahaha yes guilty as charged.

I have also gotten that "paramiko" error.

On Thu, Jun 4, 2015 at 7:05 PM, Rajat Banerjee <rajatb at post.harvard.edu>
wrote:

> Sorry, I haven't had time to dig deeply into this. Ironically, it just
> occurred on my cluster too. The load balancer kept rolling through the
> errors and only lost one node.
>
> From a superficial analysis, starcluster's add_node code spawns a thread
> for each new host, to setup the /etc/hosts file. If one fails with an
> exception, the thread is joined and the exception dumped out somewhere
> though not where i'd expect.
>
> In my stack trace but not yours, it had this paramiko error:
>
> >>> Configuring /etc/hosts on each node
> No handlers could be found for logger "paramiko.transport"             |
> 0%
> 3/3 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> 100%
> !!! ERROR - Error occured while running plugin
> 'starcluster.clustersetup.DefaultClusterSetup':
> !!! ERROR - Failed to add new host
> Traceback (most recent call last):
>   File
> "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/StarCluster-0.95.6-py2.7.egg/starcluster/balancers/sge/__init__.py",
> line 719, in _eval_add_node
>     self._cluster.add_nodes(need_to_add)
>   File
> "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/StarCluster-0.95.6-py2.7.egg/starcluster/cluster.py",
> line 1042, in add_nodes
>     self.run_plugins(method_name="on_add_node", node=node)
>   File
> "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/StarCluster-0.95.6-py2.7.egg/starcluster/cluster.py",
> line 1690, in run_plugins
>     self.run_plugin(plug, method_name=method_name, node=node)
>   File
> "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/StarCluster-0.95.6-py2.7.egg/starcluster/cluster.py",
> line 1715, in run_plugin
>     func(*args)
>   File
> "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/StarCluster-0.95.6-py2.7.egg/starcluster/clustersetup.py",
> line 425, in on_add_node
>     self._setup_etc_hosts(nodes)
>   File
> "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/StarCluster-0.95.6-py2.7.egg/starcluster/clustersetup.py",
> line 252, in _setup_etc_hosts
>     self.pool.wait(numtasks=len(nodes))
>   File
> "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/StarCluster-0.95.6-py2.7.egg/starcluster/threadpool.py",
> line 177, in wait
>     "An error occurred in ThreadPool", excs)
> ThreadPoolException: An error occurred in ThreadPool
> >>> Sleeping...(looping again in 60 secs)
>
>
>
> And stack overflow has a simple idea of how to solve that:
>
> http://stackoverflow.com/questions/19152578/no-handlers-could-be-found-for-logger-paramiko
>
> It hasn't recurred.
>
> As for your stack trace logs, I see that you're running windows and I
> think I'll never be able to solve your problems.
>
> On Wed, Jun 3, 2015 at 9:19 AM, Avner May <avnermay at cs.columbia.edu>
> wrote:
>
>> Attached are 2 more logs of load balancer crashes.
>>
>> On Tue, Jun 2, 2015 at 4:27 PM, Avner May <avnermay at cs.columbia.edu>
>> wrote:
>>
>>> 1) I am using StarCluster version 0.95.6
>>> C:\Windows\system32>starcluster addnode mycluster
>>> StarCluster - (http://star.mit.edu/cluster) (*v. 0.95.6*)
>>> Software Tools for Academics and Researchers (STAR)
>>> Please submit bug reports to starcluster at mit.edu
>>>
>>> 2) I did not try to override wait_time
>>>
>>> 3) SGE plugin is running
>>>
>>> And this particular failure occurred when the load balancer was trying
>>> to add nodes.
>>>
>>> Thanks,
>>> Avner
>>>
>>> On Tue, Jun 2, 2015 at 3:24 PM, Rajat Banerjee <rajatb at post.harvard.edu>
>>> wrote:
>>>
>>>> The log line you cited:
>>>> Traceback (most recent call last):
>>>>   File
>>>> "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\balancers\sge\__init__.py",
>>>> line 719, in _eval_add_node
>>>>
>>>>
>>>> has this, which is puzzling:
>>>> log.info("No queued jobs older than %d seconds" % self
>>>> .longest_allowed_queue_time)
>>>>
>>>> https://github.com/jtriley/StarCluster/blob/develop/starcluster/balancers/sge/__init__.py
>>>>
>>>> Three questions -
>>>> 1) Are you using an up-to-date version?
>>>> 2) did you try to override wait_time aka longest_allowed_queue_time in
>>>> your config file or on the load balancer command line? Otherwise it makes
>>>> very little sense, your stack trace looks like add_node failed, not the
>>>> load balancer
>>>> 3) Any plugins running?
>>>>
>>>> On Tue, Jun 2, 2015 at 3:06 PM, Avner May <avnermay at cs.columbia.edu>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I was writing because I have been having a lot of issues with the load
>>>>> balancer.  The most common issue I have is that it fails to remove
>>>>> instances effectively.  In a super slow fashion, it goes through the
>>>>> instances it wants to terminate (this pace is frustrating independent of
>>>>> the failure/success of the operation), and one by one fails to terminate
>>>>> each one.  Then, I am forced to kill a subset of the nodes in my cluster
>>>>> manually.  But this results in the scheduler being confused by how many
>>>>> nodes are actually in the network, so when I later submit jobs to the
>>>>> cluster again, it thinks it has enough nodes to handle that load, and
>>>>> doesn't create new instances.  So I am forced to create a ton of dummy jobs
>>>>> (eg, "qsub -V -b y -cwd hostname"), to trick the scheduler into
>>>>> thinking that it has more queued jobs than "available" machines.  These
>>>>> issues are quite annoying.
>>>>>
>>>>> Additionally, just now I had an issue where the load balancer failed
>>>>> to launch a machine:
>>>>>
>>>>> !!! ERROR - Error occured while running plugin
>>>>> 'starcluster.clustersetup.DefaultClusterSetup':
>>>>> !!! ERROR - Failed to add new host
>>>>> Traceback (most recent call last):
>>>>>   File
>>>>> "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\balancers\sge\__init__.py",
>>>>> line 719, in _eval_add_node
>>>>>     self._cluster.add_nodes(need_to_add)
>>>>>   File
>>>>> "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\cluster.py",
>>>>> line 1042, in add_nodes
>>>>>     self.run_plugins(method_name="on_add_node", node=node)
>>>>>   File
>>>>> "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\cluster.py",
>>>>> line 1690, in run_plugins
>>>>>     self.run_plugin(plug, method_name=method_name, node=node)
>>>>>   File
>>>>> "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\cluster.py",
>>>>> line 1715, in run_plugin
>>>>>     func(*args)
>>>>>   File
>>>>> "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\clustersetup.py",
>>>>> line 425, in on_add_node
>>>>>     self._setup_etc_hosts(nodes)
>>>>>   File
>>>>> "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\clustersetup.py",
>>>>> line 252, in _setup_etc_hosts
>>>>>     self.pool.wait(numtasks=len(nodes))
>>>>>   File
>>>>> "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\threadpool.py",
>>>>> line 177, in wait
>>>>>     "An error occurred in ThreadPool", excs)
>>>>> ThreadPoolException: An error occurred in ThreadPool
>>>>> >>> Sleeping...(looping again in 60 secs)
>>>>>
>>>>> After getting this error, for some reason the load balancer stopped
>>>>> recognizing the existance of the cluster:
>>>>>
>>>>> C:\Windows\system32>starcluster loadbalance --max_nodes=100
>>>>> --min_nodes=1 --add_nodes_per_iter=17 babel2
>>>>> StarCluster - (http://star.mit.edu/cluster) (v. 0.95.6)
>>>>> Software Tools for Academics and Researchers (STAR)
>>>>> Please submit bug reports to starcluster at mit.edu
>>>>>
>>>>> !!! ERROR - cluster babel2 is not running
>>>>>
>>>>> Is anyone else hitting similar issues with the load balancer?
>>>>>
>>>>> Thanks,
>>>>> Avner
>>>>>
>>>>> _______________________________________________
>>>>> StarCluster mailing list
>>>>> StarCluster at mit.edu
>>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20150604/60a74824/attachment-0001.htm


More information about the StarCluster mailing list