<div dir="ltr">To follow up, after using the hack in the link, I still find that the cluster takes a LONG time to configure etc/hosts. Any idea why this might be happening? <div>Weirder yet, all of the nodes have an identical /etc/hosts file. </div>
<div>I ran a loop sshing into all of the 31 nodes (30 + master) and cat'd the /etc/hosts file. </div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Tue, Mar 25, 2014 at 8:11 PM, Cory Dolphin <span dir="ltr"><<a href="mailto:wcdolphin@gmail.com" target="_blank">wcdolphin@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><br></div>Whenever I try and add a node to a spot instance cluster, starcluster does not properly wait for the spot request to be fulfilled, and instead errors out:<div>
<br></div><div><div><font face="courier new, monospace">starcluster addnode mycluster</font></div>
<div><font face="courier new, monospace">StarCluster - (<a href="http://star.mit.edu/cluster" target="_blank">http://star.mit.edu/cluster</a>) (v. 0.95.3)</font></div><div><font face="courier new, monospace">Software Tools for Academics and Researchers (STAR)</font></div>
<div><font face="courier new, monospace">Please submit bug reports to <a href="mailto:starcluster@mit.edu" target="_blank">starcluster@mit.edu</a></font></div><div><font face="courier new, monospace"><br></font></div><div>
<font face="courier new, monospace">>>> Launching node(s): node030</font></div>
<div><font face="courier new, monospace">SpotInstanceRequest:sir-85f44249</font></div><div><font face="courier new, monospace">>>> Waiting for spot requests to propagate...</font></div><div><font face="courier new, monospace">>>> Waiting for node(s) to come up... (updating every 30s)</font></div>
<div><font face="courier new, monospace">>>> Waiting for all nodes to be in a 'running' state...</font></div><div><font face="courier new, monospace">30/30 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%</font></div>
<div><font face="courier new, monospace">>>> Waiting for SSH to come up on all nodes...</font></div><div><font face="courier new, monospace">30/30 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%</font></div>
<div><font face="courier new, monospace">>>> Waiting for cluster to come up took 1.179 mins</font></div><div><font face="courier new, monospace">!!! ERROR - node 'node030' does not exist</font></div></div>
<div class="gmail_extra"><br></div><div class="gmail_extra"><br></div><div class="gmail_extra">Once the spot instance request is fulfilled, the instance does not have a name. Looks like someone else had this problem quite <a href="http://star.mit.edu/cluster/mlarchives/2058.html" target="_blank">recently</a>. I wonder what the difference between our setup and yours is?</div>
<div><div class="h5">
<div class="gmail_extra"><br><br><div class="gmail_quote">On Tue, Mar 25, 2014 at 7:42 PM, Rayson Ho <span dir="ltr"><<a href="mailto:raysonlogin@gmail.com" target="_blank">raysonlogin@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">If you really have a slow connection, you may consider bootstrapping<br>
StarCluster on AWS - ie. configure an m1.small (or even t1.micro) and<br>
install StarCluster on that node. In fact, there's a CloudFormation<br>
template for that:<br>
<a href="http://aws.typepad.com/aws/2012/06/ec2-spot-instance-updates-auto-scaling-and-cloudformation-integration-new-sample-app-1.html" target="_blank">http://aws.typepad.com/aws/2012/06/ec2-spot-instance-updates-auto-scaling-and-cloudformation-integration-new-sample-app-1.html</a><br>
. On the other hand, it's way easier to do it by hand and just launch<br>
an instance from the standard Ubuntu AMI, and then install StarCluster<br>
on that instance.<br>
<br>
And like others mentioned, most large StarClusters are launched by<br>
first starting a small cluster, and then grow it dynamically. You<br>
should be able to run the addnode command from your qmaster node<br>
provided that you have StarCluster setup there (note that your AWS key<br>
will be on the EC2 instance so it is slightly more risky if security<br>
is the main concern).<br>
<br>
Rayson<br>
<br>
==================================================<br>
Open Grid Scheduler - The Official Open Source Grid Engine<br>
<a href="http://gridscheduler.sourceforge.net/" target="_blank">http://gridscheduler.sourceforge.net/</a><br>
<a href="http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html" target="_blank">http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html</a><br>
<div><br>
<br>
On Tue, Mar 25, 2014 at 8:04 AM, Butson, Christopher <<a href="mailto:cbutson@mcw.edu" target="_blank">cbutson@mcw.edu</a>> wrote:<br>
</div>> Interesting: I let it go and it eventually continued but it took over an hour to Configuring passwordless ssh for root. Still waiting for the cluster to finish startup...<br>
<div><div>><br>
> Christopher R. Butson, Ph.D.<br>
> Associate Professor<br>
> Biotechnology & Bioengineering Center<br>
> Departments of Neurology, Neurosurgery, Psychiatry & Behavioral Medicine<br>
> Medical College of Wisconsin<br>
> <a href="tel:%28414%29%20955-2678" value="+14149552678" target="_blank">(414) 955-2678</a><br>
> <a href="mailto:cbutson@mcw.edu" target="_blank">cbutson@mcw.edu</a><mailto:<a href="mailto:cbutson@mcw.edu" target="_blank">cbutson@mcw.edu</a>><br>
><br>
><br>
> From: <Butson>, Christopher Butson <<a href="mailto:cbutson@mcw.edu" target="_blank">cbutson@mcw.edu</a><mailto:<a href="mailto:cbutson@mcw.edu" target="_blank">cbutson@mcw.edu</a>>><br>
> Date: Tuesday, March 25, 2014 12:13 PM<br>
> To: "<a href="mailto:starcluster@mit.edu" target="_blank">starcluster@mit.edu</a><mailto:<a href="mailto:starcluster@mit.edu" target="_blank">starcluster@mit.edu</a>>" <<a href="mailto:starcluster@mit.edu" target="_blank">starcluster@mit.edu</a><mailto:<a href="mailto:starcluster@mit.edu" target="_blank">starcluster@mit.edu</a>>><br>
> Subject: Starcluster stuck during setup<br>
><br>
> I'm on a slow internet connection overseas, trying to initiate a cluster using StarCluster. Once I type "starcluster start mycluster" everything seems to go ok but it gets stuck at the following point and never seems to get past it:<br>
>>>> Mounting all NFS export path(s) on 79 worker node(s)<br>
> 79/79 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%<br>
>>>> Setting up NFS took 2.777 mins<br>
>>>> Configuring passwordless ssh for root<br>
><br>
> Any idea why this might occur? Thanks,<br>
> Chris<br>
><br>
> Christopher R. Butson, Ph.D.<br>
> Associate Professor<br>
> Biotechnology & Bioengineering Center<br>
> Departments of Neurology, Neurosurgery, Psychiatry & Behavioral Medicine<br>
> Medical College of Wisconsin<br>
> <a href="tel:%28414%29%20955-2678" value="+14149552678" target="_blank">(414) 955-2678</a><br>
> <a href="mailto:cbutson@mcw.edu" target="_blank">cbutson@mcw.edu</a><mailto:<a href="mailto:cbutson@mcw.edu" target="_blank">cbutson@mcw.edu</a>><br>
><br>
><br>
> _______________________________________________<br>
> StarCluster mailing list<br>
> <a href="mailto:StarCluster@mit.edu" target="_blank">StarCluster@mit.edu</a><br>
> <a href="http://mailman.mit.edu/mailman/listinfo/starcluster" target="_blank">http://mailman.mit.edu/mailman/listinfo/starcluster</a><br>
_______________________________________________<br>
StarCluster mailing list<br>
<a href="mailto:StarCluster@mit.edu" target="_blank">StarCluster@mit.edu</a><br>
<a href="http://mailman.mit.edu/mailman/listinfo/starcluster" target="_blank">http://mailman.mit.edu/mailman/listinfo/starcluster</a><br>
</div></div></blockquote></div><br></div></div></div></div>
</blockquote></div><br></div>