[StarCluster] Multiple node cluster problems

Joydeep Sen Sarma jsensarma at gmail.com
Wed Nov 27 01:59:22 EST 2013


Hi Daniel and StarClusterers,

Qubole uses a fork of Starcluster and our service has been widely affected
by this problem.

What happened is that AWS broke their API that returns the user data file
for nodes launched by starcluster. I believe we can now only get the user
data file for the first instance in the launch list. As a result - calls to
get alias (which read the user data file of the node) broke.

We have filed a case with AWS - but they have been unusually sloppy in
fixing it. They haven't even acknowledged the problem in their status page.
Meanwhile - we have been busy coding up workaround hacks.

If the StarCluster community can independently complain to AWS - that might
help (perhaps). The workarounds aren't pleasent.

- Joydeep







On Tue, Nov 26, 2013 at 12:44 AM, Daniel Polhamus <danp at metrumrg.com> wrote:

> Hi all,
> We're seeing issues with using clusters consisting of multiple nodes
> today.  Launch of clusters with >=3 nodes fails, with report of being
> unable to assign aliases to nodes other than "master".  The same problem is
> seen with addnode.  Adding one node is fine, but adding more than one gives
> the alias problem again.  Terminating these clusters fails due to the
> missing alias as well, you have to use the ec2toolkit to shut down the
> offending nodes that were not named.
>
> I'm on the latest developmental version, and I've noticed that there's a
> lot of gibberish in the node user data (as viewed through the web console)
> as of today.
>
> Debug at the end, and thanks for the help.
>
> Dan
>
>
> > starcluster -d start -c testing brokenCluster -s 3
>
> ...
> ...
>
> >>> Waiting for all nodes to be in a 'running' state...
> 2013-11-25 14:13:06,323 cluster.py:734 - DEBUG - existing nodes: {}
> 2013-11-25 14:13:06,323 cluster.py:742 - DEBUG - adding node i-323f504f to
> self._nodes list
> 2013-11-25 14:13:06,839 cluster.py:742 - DEBUG - adding node i-2c3f5051 to
> self._nodes list
> 2013-11-25 14:13:07,001 node.py:147 - DEBUG - invalid aliases file in
> user_data:
>
> 3/3 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> 100%
> !!! ERROR - instance i-2c3f5051 has no alias
> 2013-11-25 14:13:07,003 cli.py:301 - DEBUG - instance i-2c3f5051 has no
> alias
> Traceback (most recent call last):
>   File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/cli.py",
> line 274, in main
>     sc.execute(args)
>   File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/commands/start.py",
> line 220, in execute
>     validate_running=validate_running)
>   File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/cluster.py",
> line 1534, in start
>     return self._start(create=create, create_only=create_only)
>   File "<string>", line 2, in _start
>   File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/utils.py",
> line 111, in wrap_f
>     res = func(*arg, **kargs)
>   File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/cluster.py",
> line 1557, in _start
>     self.setup_cluster()
>   File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/cluster.py",
> line 1565, in setup_cluster
>     self.wait_for_cluster()
>   File "<string>", line 2, in wait_for_cluster
>   File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/utils.py",
> line 111, in wrap_f
>     res = func(*arg, **kargs)
>   File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/cluster.py",
> line 1350, in wait_for_cluster
>     self.wait_for_running_instances()
>   File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/cluster.py",
> line 1305, in wait_for_running_instances
>     nodes = nodes or self.get_nodes_or_raise()
>   File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/cluster.py",
> line 754, in get_nodes_or_raise
>     nodes = self.nodes
>   File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/cluster.py",
> line 744, in nodes
>     if n.is_master():
>   File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/node.py",
> line 898, in is_master
>     return self.alias == 'master' or self.alias.endswith("-master")
>   File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/node.py",
> line 150, in alias
>     "instance %s has no alias" % self.id)
> BaseException: instance i-2c3f5051 has no alias
>
> --
> Daniel G Polhamus, PhD
> Metrum Research Group, LLC
>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster at mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20131127/4d4dc903/attachment.htm


More information about the StarCluster mailing list