[StarCluster] Multiple node cluster problems
Justin Riley
jtriley at MIT.EDU
Thu Dec 12 12:00:13 EST 2013
Hi Joydeep/Dan,
Is this still an issue for you? I'm not able to reproduce this in
us-east-1 (EC2-classic for me) or us-west-2 (VPC for me). This is a
*major* issue if this change is indeed permanent. What region are you
running into this with?
Thanks for reporting.
~Justin
On Wed, Nov 27, 2013 at 12:29:22PM +0530, Joydeep Sen Sarma wrote:
> Hi Daniel and StarClusterers,
> Qubole uses a fork of Starcluster and our service has been widely affected
> by this problem.
> What happened is that AWS broke their API that returns the user data file
> for nodes launched by starcluster. I believe we can now only get the user
> data file for the first instance in the launch list. As a result - calls
> to get alias (which read the user data file of the node) broke.
> We have filed a case with AWS - but they have been unusually sloppy in
> fixing it. They haven't even acknowledged the problem in their status
> page. Meanwhile - we have been busy coding up workaround hacks.
> If the StarCluster community can independently complain to AWS - that
> might help (perhaps). The workarounds aren't pleasent.
> - Joydeep
>
> On Tue, Nov 26, 2013 at 12:44 AM, Daniel Polhamus <[1]danp at metrumrg.com>
> wrote:
>
> Hi all,
> We're seeing issues with using clusters consisting of multiple nodes
> today. Launch of clusters with >=3 nodes fails, with report of being
> unable to assign aliases to nodes other than "master". The same problem
> is seen with addnode. Adding one node is fine, but adding more than one
> gives the alias problem again. Terminating these clusters fails due to
> the missing alias as well, you have to use the ec2toolkit to shut down
> the offending nodes that were not named.
> I'm on the latest developmental version, and I've noticed that there's a
> lot of gibberish in the node user data (as viewed through the web
> console) as of today.
> Debug at the end, and thanks for the help.
> Dan
> > starcluster -d start -c testing brokenCluster -s 3
> ...
> ...
> >>> Waiting for all nodes to be in a 'running' state...
> 2013-11-25 14:13:06,323 cluster.py:734 - DEBUG - existing nodes: {}
> 2013-11-25 14:13:06,323 cluster.py:742 - DEBUG - adding node i-323f504f
> to self._nodes list
> 2013-11-25 14:13:06,839 cluster.py:742 - DEBUG - adding node i-2c3f5051
> to self._nodes list
> 2013-11-25 14:13:07,001 node.py:147 - DEBUG - invalid aliases file in
> user_data:
> 3/3 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> 100%
> !!! ERROR - instance i-2c3f5051 has no alias
> 2013-11-25 14:13:07,003 cli.py:301 - DEBUG - instance i-2c3f5051 has no
> alias
> Traceback (most recent call last):
> File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/cli.py",
> line 274, in main
> sc.execute(args)
> File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/commands/start.py",
> line 220, in execute
> validate_running=validate_running)
> File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/cluster.py",
> line 1534, in start
> return self._start(create=create, create_only=create_only)
> File "<string>", line 2, in _start
> File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/utils.py",
> line 111, in wrap_f
> res = func(*arg, **kargs)
> File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/cluster.py",
> line 1557, in _start
> self.setup_cluster()
> File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/cluster.py",
> line 1565, in setup_cluster
> self.wait_for_cluster()
> File "<string>", line 2, in wait_for_cluster
> File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/utils.py",
> line 111, in wrap_f
> res = func(*arg, **kargs)
> File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/cluster.py",
> line 1350, in wait_for_cluster
> self.wait_for_running_instances()
> File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/cluster.py",
> line 1305, in wait_for_running_instances
> nodes = nodes or self.get_nodes_or_raise()
> File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/cluster.py",
> line 754, in get_nodes_or_raise
> nodes = self.nodes
> File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/cluster.py",
> line 744, in nodes
> if n.is_master():
> File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/node.py",
> line 898, in is_master
> return self.alias == 'master' or self.alias.endswith("-master")
> File
> "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/node.py",
> line 150, in alias
> "instance %s has no alias" % [2]self.id)
> BaseException: instance i-2c3f5051 has no alias
> --
> Daniel G Polhamus, PhD
> Metrum Research Group, LLC
> _______________________________________________
> StarCluster mailing list
> [3]StarCluster at mit.edu
> [4]http://mailman.mit.edu/mailman/listinfo/starcluster
>
> References
>
> Visible links
> 1. mailto:danp at metrumrg.com
> 2. http://self.id/
> 3. mailto:StarCluster at mit.edu
> 4. http://mailman.mit.edu/mailman/listinfo/starcluster
> _______________________________________________
> StarCluster mailing list
> StarCluster at mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
Url : http://mailman.mit.edu/pipermail/starcluster/attachments/20131212/90633acc/attachment-0001.bin
More information about the StarCluster
mailing list