[StarCluster] Crash report

Wed Feb 6 00:17:06 EST 2013

Also, somehow this cluster got into a weird state, with two copies of
node001:

Cluster nodes:
     master running i-b0a5cec0 ec2-204-236-252-51.compute-1.amazonaws.com
    node001 running i-5c3e542c ec2-54-235-230-217.compute-1.amazonaws.com
    node001 running i-063e5476 ec2-23-20-247-62.compute-1.amazonaws.com
    node002 running i-5a32582a ec2-23-23-20-49.compute-1.amazonaws.com
    node003 running i-5c32582c ec2-54-242-192-21.compute-1.amazonaws.com
    node004 running i-da741eaa ec2-23-22-42-142.compute-1.amazonaws.com
    node005 running i-dc741eac ec2-50-16-179-158.compute-1.amazonaws.com
    node006 running i-a06515d0 ec2-50-19-184-152.compute-1.amazonaws.com
    node007 running i-a26515d2 ec2-54-234-70-30.compute-1.amazonaws.com
    node008 running i-c4493ab4 ec2-54-242-116-109.compute-1.amazonaws.com
    node009 running i-c6493ab6 ec2-107-22-61-85.compute-1.amazonaws.com
    node010 running i-c8493ab8 ec2-23-20-134-170.compute-1.amazonaws.com
Total nodes: 12

Also some nodes (e.g. 002, 003, 004) were not listed in @allhosts in the
queue.
Possibly this is because I was running the load balancer?  It didn't seem
to be working quite right; it wasn't really removing nodes.

Dan

On Wed, Feb 6, 2013 at 12:14 AM, Daniel Povey <dpovey at gmail.com> wrote:

> BTW, I manually removed it from the queue using qconf -mhgrp @allhosts
> before I called the rn command (because I wanted to make sure no jobs were
> running on the nodes I was removing and I wasn't sure whether the rn
> command would wait).  Not sure if this would cause the crash.
>
> Dan
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20130206/977461a9/attachment.htm