[StarCluster] Load Balancer issues
Avner May
avnermay at cs.columbia.edu
Tue Jun 2 16:27:10 EDT 2015
1) I am using StarCluster version 0.95.6
C:\Windows\system32>starcluster addnode mycluster
StarCluster - (http://star.mit.edu/cluster) (*v. 0.95.6*)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster at mit.edu
2) I did not try to override wait_time
3) SGE plugin is running
And this particular failure occurred when the load balancer was trying to
add nodes.
Thanks,
Avner
On Tue, Jun 2, 2015 at 3:24 PM, Rajat Banerjee <rajatb at post.harvard.edu>
wrote:
> The log line you cited:
> Traceback (most recent call last):
> File
> "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\balancers\sge\__init__.py",
> line 719, in _eval_add_node
>
>
> has this, which is puzzling:
> log.info("No queued jobs older than %d seconds" % self
> .longest_allowed_queue_time)
>
> https://github.com/jtriley/StarCluster/blob/develop/starcluster/balancers/sge/__init__.py
>
> Three questions -
> 1) Are you using an up-to-date version?
> 2) did you try to override wait_time aka longest_allowed_queue_time in
> your config file or on the load balancer command line? Otherwise it makes
> very little sense, your stack trace looks like add_node failed, not the
> load balancer
> 3) Any plugins running?
>
> On Tue, Jun 2, 2015 at 3:06 PM, Avner May <avnermay at cs.columbia.edu>
> wrote:
>
>> Hi all,
>>
>> I was writing because I have been having a lot of issues with the load
>> balancer. The most common issue I have is that it fails to remove
>> instances effectively. In a super slow fashion, it goes through the
>> instances it wants to terminate (this pace is frustrating independent of
>> the failure/success of the operation), and one by one fails to terminate
>> each one. Then, I am forced to kill a subset of the nodes in my cluster
>> manually. But this results in the scheduler being confused by how many
>> nodes are actually in the network, so when I later submit jobs to the
>> cluster again, it thinks it has enough nodes to handle that load, and
>> doesn't create new instances. So I am forced to create a ton of dummy jobs
>> (eg, "qsub -V -b y -cwd hostname"), to trick the scheduler into thinking
>> that it has more queued jobs than "available" machines. These issues are
>> quite annoying.
>>
>> Additionally, just now I had an issue where the load balancer failed to
>> launch a machine:
>>
>> !!! ERROR - Error occured while running plugin
>> 'starcluster.clustersetup.DefaultClusterSetup':
>> !!! ERROR - Failed to add new host
>> Traceback (most recent call last):
>> File
>> "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\balancers\sge\__init__.py",
>> line 719, in _eval_add_node
>> self._cluster.add_nodes(need_to_add)
>> File
>> "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\cluster.py",
>> line 1042, in add_nodes
>> self.run_plugins(method_name="on_add_node", node=node)
>> File
>> "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\cluster.py",
>> line 1690, in run_plugins
>> self.run_plugin(plug, method_name=method_name, node=node)
>> File
>> "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\cluster.py",
>> line 1715, in run_plugin
>> func(*args)
>> File
>> "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\clustersetup.py",
>> line 425, in on_add_node
>> self._setup_etc_hosts(nodes)
>> File
>> "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\clustersetup.py",
>> line 252, in _setup_etc_hosts
>> self.pool.wait(numtasks=len(nodes))
>> File
>> "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\threadpool.py",
>> line 177, in wait
>> "An error occurred in ThreadPool", excs)
>> ThreadPoolException: An error occurred in ThreadPool
>> >>> Sleeping...(looping again in 60 secs)
>>
>> After getting this error, for some reason the load balancer stopped
>> recognizing the existance of the cluster:
>>
>> C:\Windows\system32>starcluster loadbalance --max_nodes=100 --min_nodes=1
>> --add_nodes_per_iter=17 babel2
>> StarCluster - (http://star.mit.edu/cluster) (v. 0.95.6)
>> Software Tools for Academics and Researchers (STAR)
>> Please submit bug reports to starcluster at mit.edu
>>
>> !!! ERROR - cluster babel2 is not running
>>
>> Is anyone else hitting similar issues with the load balancer?
>>
>> Thanks,
>> Avner
>>
>> _______________________________________________
>> StarCluster mailing list
>> StarCluster at mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20150602/c65b2b03/attachment.htm
More information about the StarCluster
mailing list