[StarCluster] force starcluster run
Adam Kraut
adamnkraut at gmail.com
Mon Dec 6 14:00:31 EST 2010
Hi Justin,
Thanks for the tips.
On Dec 5, 2010, at 8:58 PM, Justin Riley wrote:
> Hi Adam,
>
>> StarCluster rocks! Great job Justin et al.
>
> Thanks a lot, glad you like it :D
>
>> I was using starcluster (v. 0.9999) to start an 80 node spot instance cluster recently and run into an issue.
>>
>> starcluster start -b 0.10 -s 80 SpotCluster
>
> Wow, OK, I haven't tried with that many nodes before but it should work.
> Please be patient with the setup, I'd imagine this will take some time
> given the size of the cluster.
Yeah. At a cluster size of 80 it just takes a long time. Looking at the debug output it's the sheer number of remote ssh commands that slows the setup down so much.
>
>> It took a few minutes for the spots to open and the instances to be running. StarCluster was still waiting on instances to come up so I ran the start command with --no-create
>>
>> starcluster start --no-create -s 80 SpotCluster
>> starcluster start --no-create SpotCluster
>>
>> I can verify with the AWS console and the output 'starcluster listclusters' that all 80 instances are up and running. Is there a way to force starcluster to run the install? Is starcluster checking something other than ec2-describe-instances like ssh to see if a node is up?
>>
>> Not sure if this is due to the cluster size, spot instances, or just an anomaly like one node not starting sshd.
>
> StarCluster checks that there are CLUSTER_SIZE nodes in a 'running'
> state and whether ssh is up on all the 'running' nodes in the cluster
> when it is 'Waiting for cluster to start'. This is the reason why
> StarCluster is still waiting even though you see all instances 'running'
> in ec2-describe-instances; ssh is likely not up yet for *all* instances
> even though they're all in a 'running' state. There really can't be a
> 'force install' because StarCluster has to be able to connect to all
> nodes in the cluster via ssh before it can do anything with the instances.
>
> With that said I'd also expect this process of checking ssh on all the
> nodes to take some time so if you're not patient you may not end up
> giving StarCluster enough time to make connections to all 80 nodes. How
> long did you wait for StarCluster before canceling the run?
I waited at least 30 minutes before killing it the first time. I eventually got it running. A couple of things going on here worth noting. One of the 80 nodes just wasn't responding on 22 and the instance required a reboot. That's an EC2 issue but I've run into it on smaller clusters so it's good that starcluster checks. The time spent on validating instances was still a problem though, and I eventually commented out lines 1162-1163 in cluster.py
# while not self.is_cluster_up(enforce_size):
# time.sleep(30)
I only did this since I knew the instances were in 'running' state for a while. StarCluster setup for large clusters is going to be tricky especially on slow connections. The setup time could be 30-60 minutes or more. The --no-create option is essential in this case. Is there a --skip-validation option? I think some of this could be remedied by parallelizing ssh connections. Does paramiko support parallel/threaded ssh commands? That could be a good start.
StarCluster still did a great job considering the cluster size! Just some things to think about for anyone planning to scale beyond ~64 nodes.
Thanks again,
Adam
More information about the StarCluster
mailing list