<div dir="ltr">Also, somehow this cluster got into a weird state, with two copies of node001:<div><br></div><div><div>Cluster nodes:</div><div>     master running i-b0a5cec0 <a href="http://ec2-204-236-252-51.compute-1.amazonaws.com">ec2-204-236-252-51.compute-1.amazonaws.com</a></div>
<div>    node001 running i-5c3e542c <a href="http://ec2-54-235-230-217.compute-1.amazonaws.com">ec2-54-235-230-217.compute-1.amazonaws.com</a></div><div>    node001 running i-063e5476 <a href="http://ec2-23-20-247-62.compute-1.amazonaws.com">ec2-23-20-247-62.compute-1.amazonaws.com</a></div>
<div>    node002 running i-5a32582a <a href="http://ec2-23-23-20-49.compute-1.amazonaws.com">ec2-23-23-20-49.compute-1.amazonaws.com</a></div><div>    node003 running i-5c32582c <a href="http://ec2-54-242-192-21.compute-1.amazonaws.com">ec2-54-242-192-21.compute-1.amazonaws.com</a></div>
<div>    node004 running i-da741eaa <a href="http://ec2-23-22-42-142.compute-1.amazonaws.com">ec2-23-22-42-142.compute-1.amazonaws.com</a></div><div>    node005 running i-dc741eac <a href="http://ec2-50-16-179-158.compute-1.amazonaws.com">ec2-50-16-179-158.compute-1.amazonaws.com</a></div>
<div>    node006 running i-a06515d0 <a href="http://ec2-50-19-184-152.compute-1.amazonaws.com">ec2-50-19-184-152.compute-1.amazonaws.com</a></div><div>    node007 running i-a26515d2 <a href="http://ec2-54-234-70-30.compute-1.amazonaws.com">ec2-54-234-70-30.compute-1.amazonaws.com</a></div>
<div>    node008 running i-c4493ab4 <a href="http://ec2-54-242-116-109.compute-1.amazonaws.com">ec2-54-242-116-109.compute-1.amazonaws.com</a></div><div>    node009 running i-c6493ab6 <a href="http://ec2-107-22-61-85.compute-1.amazonaws.com">ec2-107-22-61-85.compute-1.amazonaws.com</a></div>
<div>    node010 running i-c8493ab8 <a href="http://ec2-23-20-134-170.compute-1.amazonaws.com">ec2-23-20-134-170.compute-1.amazonaws.com</a></div><div>Total nodes: 12</div></div><div><br></div><div>Also some nodes (e.g. 002, 003, 004) were not listed in @allhosts in the queue.</div>
<div style>Possibly this is because I was running the load balancer?  It didn&#39;t seem to be working quite right; it wasn&#39;t really removing nodes.</div><div style><br></div><div style>Dan</div><div style><br></div></div>
<div class="gmail_extra"><br><br><div class="gmail_quote">On Wed, Feb 6, 2013 at 12:14 AM, Daniel Povey <span dir="ltr">&lt;<a href="mailto:dpovey@gmail.com" target="_blank">dpovey@gmail.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">BTW, I manually removed it from the queue using qconf -mhgrp @allhosts before I called the rn command (because I wanted to make sure no jobs were running on the nodes I was removing and I wasn&#39;t sure whether the rn command would wait).  Not sure if this would cause the crash.<div>

<br></div><div>Dan</div><div><br></div></div>
</blockquote></div><br></div>