[StarCluster] StarCluster Digest, Vol 22, Issue 1
Don MacMillen
macd at nimbic.com
Thu Jun 2 09:21:08 EDT 2011
Thanks Raj. Some additional information. We are using
slightly modified EBS backed StarCluster images. (Mods were only
to add our executables). The EBS backed images probably have
different timing characteristics than the instance store ones.
Another quick question: Does 'starcluster terminate <clustername>'
call the 'on_remove_node' method of the plugin? It looks like
it does not but apologies if this is documented already. From our
point of view, it would be useful for the terminate cluster command
to call this method.
Many thanks.
Don
On Wed, Jun 1, 2011 at 9:15 AM, <starcluster-request at mit.edu> wrote:
> Send StarCluster mailing list submissions to
> starcluster at mit.edu
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://mailman.mit.edu/mailman/listinfo/starcluster
> or, via email, send a message with subject or body 'help' to
> starcluster-request at mit.edu
>
> You can reach the person managing the list at
> starcluster-owner at mit.edu
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of StarCluster digest..."
>
>
> Today's Topics:
>
> 1. Orphaned nodes (addnode failure) and ELB going over max
> cluster size when adding more than one node (Don MacMillen)
> 2. Re: Orphaned nodes (addnode failure) and ELB going over max
> cluster size when adding more than one node (Rajat Banerjee)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Sun, 29 May 2011 13:19:03 -0700
> From: Don MacMillen <macd at nimbic.com>
> Subject: [StarCluster] Orphaned nodes (addnode failure) and ELB going
> over max cluster size when adding more than one node
> To: starcluster at mit.edu
> Message-ID: <BANLkTimKLfR-RMGV9OB==W_H_daXvhzxWg at mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> HI,
>
> Two issues here, as reported earlier. On the first one, running with new
> logging
> turned on, I see an intermittent failure of 'starcluster addnode
> <clustername>'.
> Error trace from log file below.
>
> Second, on ELB adding too many nodes when adding more than one node
> per iteration. The code at
> StarCluster/starcluster/plugins/sge/__init__.py
> at line 637 reads:
>
> if need_to_add > 0:
> need_to_add = min(self.add_nodes_per_iteration, need_to_add)
>
> The fix could be as simple as:
>
> if need_to_add > 0:
> head_room = self.max_nodes - self.stat.hosts
> need_to_add = min(self.add_nodes_per_iteration, need_to_add,
> head_room)
>
> depending upon what you know about self.max_node and self.stat.hosts.
>
> Regards,
>
> Don
>
> PID: 12630 cluster.py:686 - DEBUG - returning self._nodes = [<Node: master
> (i-13d0707d)>, <Node: node001 (i-11d0\
> 707f)>, <Node: node002 (i-efd07081)>, <Node: node003 (i-edd07083)>]
> PID: 12630 cluster.py:670 - DEBUG - existing nodes: {u'i-13d0707d': <Node:
> master (i-13d0707d)>, u'i-11d0707f': \
> <Node: node001 (i-11d0707f)>, u'i-efd07081': <Node: node002 (i-efd07081)>,
> u'i-edd07083': <Node: node003 (i-edd0\
> 7083)>}
> PID: 12630 cluster.py:673 - DEBUG - updating existing node i-13d0707d in
> self._nodes
> PID: 12630 cluster.py:673 - DEBUG - updating existing node i-11d0707f in
> self._nodes
> PID: 12630 cluster.py:673 - DEBUG - updating existing node i-efd07081 in
> self._nodes
> PID: 12630 cluster.py:673 - DEBUG - updating existing node i-edd07083 in
> self._nodes
> PID: 12630 cluster.py:686 - DEBUG - returning self._nodes = [<Node: master
> (i-13d0707d)>, <Node: node001 (i-11d0\
> 707f)>, <Node: node002 (i-efd07081)>, <Node: node003 (i-edd07083)>]
> PID: 12630 clustersetup.py:96 - INFO - Configuring hostnames...
> PID: 12630 cli.py:182 - DEBUG - Traceback (most recent call last):
> File "build/bdist.linux-i686/egg/starcluster/cli.py", line 160, in main
> sc.execute(args)
> File "build/bdist.linux-i686/egg/starcluster/commands/addnode.py", line
> 37, in execute
> self.cm.add_node(tag, aliases)
> File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 119, in
> add_node
> cl.add_node(alias)
> File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 770, in
> add_node
> self.add_nodes(1, aliases=aliases)
> File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 805, in
> add_nodes
> self.volumes)
> File "build/bdist.linux-i686/egg/starcluster/clustersetup.py", line 510,
> in on_add_node
> self._setup_hostnames(nodes=[node])
> File "build/bdist.linux-i686/egg/starcluster/clustersetup.py", line 98, in
> _setup_hostnames
> self.pool.simple_job(node.set_hostname, (), jobid=node.alias)
> AttributeError: 'NoneType' object has no attribute 'set_hostname'
>
> PID: 12630 cli.py:129 - ERROR - Oops! Looks like you've found a bug in
> StarCluster
> PID: 12630 cli.py:130 - ERROR - Debug file written to:
> /tmp/starcluster-debug-staruser.log
> PID: 12630 cli.py:131 - ERROR - Look for lines starting with PID: 12630
> PID: 12630 cli.py:132 - ERROR - Please submit this file, minus any private
> information,
> PID: 12630 cli.py:133 - ERROR - to starcluster at mit.edu
> PID: 12630 ssh.py:536 - DEBUG - __del__ called
> PID: 12630 ssh.py:536 - DEBUG - __del__ called
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://mailman.mit.edu/pipermail/starcluster/attachments/20110529/8f412459/attachment-0001.htm
>
> ------------------------------
>
> Message: 2
> Date: Tue, 31 May 2011 14:46:54 -0400
> From: Rajat Banerjee <rbanerj at fas.harvard.edu>
> Subject: Re: [StarCluster] Orphaned nodes (addnode failure) and ELB
> going over max cluster size when adding more than one node
> To: Don MacMillen <macd at nimbic.com>
> Cc: starcluster at mit.edu
> Message-ID: <BANLkTi=a0ZFQJ1B78-bBi_Zpw7A0PXSZZg at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> Hi Don,
> Thanks for your suggestions. I will test out the earlier suggestion as
> soon as I can and issue a pull request so that Justin can put it into
> the master branch.
>
> Regarding the latter suggestion, I think the add_node code needs to be
> more robust to detect and correct timing errors. Will work on it...
>
> Raj
>
> On Sun, May 29, 2011 at 4:19 PM, Don MacMillen <macd at nimbic.com> wrote:
> > HI,
> >
> > Two issues here, as reported earlier.? On the first one, running with new
> > logging
> > turned on, I see an intermittent failure of 'starcluster addnode
> > <clustername>'.
> > Error trace from log file below.
> >
> > Second, on ELB adding too many nodes when adding more than one node
> > per iteration.? The code at?
> StarCluster/starcluster/plugins/sge/__init__.py
> > at line 637 reads:
> >
> > ??????? if need_to_add > 0:
> > ??????????? need_to_add = min(self.add_nodes_per_iteration, need_to_add)
> >
> > The fix could be as simple as:
> >
> > ??????? if need_to_add > 0:
> > ??????????? head_room = self.max_nodes - self.stat.hosts
> > ??????????? need_to_add = min(self.add_nodes_per_iteration, need_to_add,
> > head_room)
> >
> > depending upon what you know about self.max_node and self.stat.hosts.
> >
> > Regards,
> >
> > Don
> >
> > PID: 12630 cluster.py:686 - DEBUG - returning self._nodes = [<Node:
> master
> > (i-13d0707d)>, <Node: node001 (i-11d0\
> > 707f)>, <Node: node002 (i-efd07081)>, <Node: node003 (i-edd07083)>]
> > PID: 12630 cluster.py:670 - DEBUG - existing nodes: {u'i-13d0707d':
> <Node:
> > master (i-13d0707d)>, u'i-11d0707f': \
> > <Node: node001 (i-11d0707f)>, u'i-efd07081': <Node: node002
> (i-efd07081)>,
> > u'i-edd07083': <Node: node003 (i-edd0\
> > 7083)>}
> > PID: 12630 cluster.py:673 - DEBUG - updating existing node i-13d0707d in
> > self._nodes
> > PID: 12630 cluster.py:673 - DEBUG - updating existing node i-11d0707f in
> > self._nodes
> > PID: 12630 cluster.py:673 - DEBUG - updating existing node i-efd07081 in
> > self._nodes
> > PID: 12630 cluster.py:673 - DEBUG - updating existing node i-edd07083 in
> > self._nodes
> > PID: 12630 cluster.py:686 - DEBUG - returning self._nodes = [<Node:
> master
> > (i-13d0707d)>, <Node: node001 (i-11d0\
> > 707f)>, <Node: node002 (i-efd07081)>, <Node: node003 (i-edd07083)>]
> > PID: 12630 clustersetup.py:96 - INFO - Configuring hostnames...
> > PID: 12630 cli.py:182 - DEBUG - Traceback (most recent call last):
> > ? File "build/bdist.linux-i686/egg/starcluster/cli.py", line 160, in main
> > ??? sc.execute(args)
> > ? File "build/bdist.linux-i686/egg/starcluster/commands/addnode.py", line
> > 37, in execute
> > ??? self.cm.add_node(tag, aliases)
> > ? File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 119, in
> > add_node
> > ??? cl.add_node(alias)
> > ? File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 770, in
> > add_node
> > ??? self.add_nodes(1, aliases=aliases)
> > ? File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 805, in
> > add_nodes
> > ??? self.volumes)
> > ? File "build/bdist.linux-i686/egg/starcluster/clustersetup.py", line
> 510,
> > in on_add_node
> > ??? self._setup_hostnames(nodes=[node])
> > ? File "build/bdist.linux-i686/egg/starcluster/clustersetup.py", line 98,
> in
> > _setup_hostnames
> > ??? self.pool.simple_job(node.set_hostname, (), jobid=node.alias)
> > AttributeError: 'NoneType' object has no attribute 'set_hostname'
> >
> > PID: 12630 cli.py:129 - ERROR - Oops! Looks like you've found a bug in
> > StarCluster
> > PID: 12630 cli.py:130 - ERROR - Debug file written to:
> > /tmp/starcluster-debug-staruser.log
> > PID: 12630 cli.py:131 - ERROR - Look for lines starting with PID: 12630
> > PID: 12630 cli.py:132 - ERROR - Please submit this file, minus any
> private
> > information,
> > PID: 12630 cli.py:133 - ERROR - to starcluster at mit.edu
> > PID: 12630 ssh.py:536 - DEBUG - __del__ called
> > PID: 12630 ssh.py:536 - DEBUG - __del__ called
> >
> >
> > _______________________________________________
> > StarCluster mailing list
> > StarCluster at mit.edu
> > http://mailman.mit.edu/mailman/listinfo/starcluster
> >
> >
>
>
>
> ------------------------------
>
> _______________________________________________
> StarCluster mailing list
> StarCluster at mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
> End of StarCluster Digest, Vol 22, Issue 1
> ******************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20110602/648b35a3/attachment.htm
More information about the StarCluster
mailing list