[StarCluster] Multiple node cluster problems

Justin Riley jtriley at MIT.EDU
Thu Dec 12 12:00:13 EST 2013


Hi Joydeep/Dan,

Is this still an issue for you? I'm not able to reproduce this in
us-east-1 (EC2-classic for me) or us-west-2 (VPC for me). This is a
*major* issue if this change is indeed permanent. What region are you
running into this with?

Thanks for reporting.

~Justin


On Wed, Nov 27, 2013 at 12:29:22PM +0530, Joydeep Sen Sarma wrote:
>    Hi Daniel and StarClusterers,
>    Qubole uses a fork of Starcluster and our service has been widely affected
>    by this problem.
>    What happened is that AWS broke their API that returns the user data file
>    for nodes launched by starcluster. I believe we can now only get the user
>    data file for the first instance in the launch list. As a result - calls
>    to get alias (which read the user data file of the node) broke.
>    We have filed a case with AWS - but they have been unusually sloppy in
>    fixing it. They haven't even acknowledged the problem in their status
>    page. Meanwhile - we have been busy coding up workaround hacks.
>    If the StarCluster community can independently complain to AWS - that
>    might help (perhaps). The workarounds aren't pleasent.
>    - Joydeep
> 
>    On Tue, Nov 26, 2013 at 12:44 AM, Daniel Polhamus <[1]danp at metrumrg.com>
>    wrote:
> 
>      Hi all,
>      We're seeing issues with using clusters consisting of multiple nodes
>      today.  Launch of clusters with >=3 nodes fails, with report of being
>      unable to assign aliases to nodes other than "master".  The same problem
>      is seen with addnode.  Adding one node is fine, but adding more than one
>      gives the alias problem again.  Terminating these clusters fails due to
>      the missing alias as well, you have to use the ec2toolkit to shut down
>      the offending nodes that were not named.
>      I'm on the latest developmental version, and I've noticed that there's a
>      lot of gibberish in the node user data (as viewed through the web
>      console) as of today.
>      Debug at the end, and thanks for the help.
>      Dan
>      > starcluster -d start -c testing brokenCluster -s 3
>      ... 
>      ...
>      >>> Waiting for all nodes to be in a 'running' state...
>      2013-11-25 14:13:06,323 cluster.py:734 - DEBUG - existing nodes: {}
>      2013-11-25 14:13:06,323 cluster.py:742 - DEBUG - adding node i-323f504f
>      to self._nodes list
>      2013-11-25 14:13:06,839 cluster.py:742 - DEBUG - adding node i-2c3f5051
>      to self._nodes list
>      2013-11-25 14:13:07,001 node.py:147 - DEBUG - invalid aliases file in
>      user_data:
>      3/3 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
>      100%
>      !!! ERROR - instance i-2c3f5051 has no alias
>      2013-11-25 14:13:07,003 cli.py:301 - DEBUG - instance i-2c3f5051 has no
>      alias
>      Traceback (most recent call last):
>        File
>      "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/cli.py",
>      line 274, in main
>          sc.execute(args)
>        File
>      "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/commands/start.py",
>      line 220, in execute
>          validate_running=validate_running)
>        File
>      "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/cluster.py",
>      line 1534, in start
>          return self._start(create=create, create_only=create_only)
>        File "<string>", line 2, in _start
>        File
>      "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/utils.py",
>      line 111, in wrap_f
>          res = func(*arg, **kargs)
>        File
>      "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/cluster.py",
>      line 1557, in _start
>          self.setup_cluster()
>        File
>      "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/cluster.py",
>      line 1565, in setup_cluster
>          self.wait_for_cluster()
>        File "<string>", line 2, in wait_for_cluster
>        File
>      "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/utils.py",
>      line 111, in wrap_f
>          res = func(*arg, **kargs)
>        File
>      "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/cluster.py",
>      line 1350, in wait_for_cluster
>          self.wait_for_running_instances()
>        File
>      "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/cluster.py",
>      line 1305, in wait_for_running_instances
>          nodes = nodes or self.get_nodes_or_raise()
>        File
>      "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/cluster.py",
>      line 754, in get_nodes_or_raise
>          nodes = self.nodes
>        File
>      "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/cluster.py",
>      line 744, in nodes
>          if n.is_master():
>        File
>      "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/node.py",
>      line 898, in is_master
>          return self.alias == 'master' or self.alias.endswith("-master")
>        File
>      "/Library/Python/2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/node.py",
>      line 150, in alias
>          "instance %s has no alias" % [2]self.id)
>      BaseException: instance i-2c3f5051 has no alias
>      --
>      Daniel G Polhamus, PhD
>      Metrum Research Group, LLC
>      _______________________________________________
>      StarCluster mailing list
>      [3]StarCluster at mit.edu
>      [4]http://mailman.mit.edu/mailman/listinfo/starcluster
> 
> References
> 
>    Visible links
>    1. mailto:danp at metrumrg.com
>    2. http://self.id/
>    3. mailto:StarCluster at mit.edu
>    4. http://mailman.mit.edu/mailman/listinfo/starcluster

> _______________________________________________
> StarCluster mailing list
> StarCluster at mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
Url : http://mailman.mit.edu/pipermail/starcluster/attachments/20131212/90633acc/attachment-0001.bin


More information about the StarCluster mailing list