[StarCluster] force starcluster run

Mon Dec 6 14:00:31 EST 2010

Hi Justin,

Thanks for the tips.

On Dec 5, 2010, at 8:58 PM, Justin Riley wrote:

> Hi Adam,
> 
>> StarCluster rocks! Great job Justin et al.
> 
> Thanks a lot, glad you like it :D
> 
>> I was using starcluster (v. 0.9999) to start an 80 node spot instance cluster recently and run into an issue.
>> 
>> starcluster start -b 0.10 -s 80 SpotCluster
> 
> Wow, OK, I haven't tried with that many nodes before but it should work. 
> Please be patient with the setup, I'd imagine this will take some time 
> given the size of the cluster.

Yeah.  At a cluster size of 80 it just takes a long time.  Looking at the debug output it's the sheer number of remote ssh commands that slows the setup down so much.

> 
>> It took a few minutes for the spots to open and the instances to be running.  StarCluster was still waiting on instances to come up so I ran the start command with --no-create
>> 
>> starcluster start --no-create -s 80 SpotCluster
>> starcluster start --no-create SpotCluster
>> 
>> I can verify with the AWS console and the output 'starcluster listclusters' that all 80 instances are up and running.  Is there a way to force starcluster to run the install?  Is starcluster checking something other than ec2-describe-instances like ssh to see if a node is up?
>> 
>> Not sure if this is due to the cluster size, spot instances, or just an anomaly like one node not starting sshd.
> 
> StarCluster checks that there are CLUSTER_SIZE nodes in a 'running' 
> state and whether ssh is up on all the 'running' nodes in the cluster 
> when it is 'Waiting for cluster to start'. This is the reason why 
> StarCluster is still waiting even though you see all instances 'running' 
> in ec2-describe-instances; ssh is likely not up yet for *all* instances 
> even though they're all in a 'running' state. There really can't be a 
> 'force install' because StarCluster has to be able to connect to all 
> nodes in the cluster via ssh before it can do anything with the instances.
> 
> With that said I'd also expect this process of checking ssh on all the 
> nodes to take some time so if you're not patient you may not end up 
> giving StarCluster enough time to make connections to all 80 nodes. How 
> long did you wait for StarCluster before canceling the run?

I waited at least 30 minutes before killing it the first time.  I eventually got it running.  A couple of things going on here worth noting.  One of the 80 nodes just wasn't responding on 22 and the instance required a reboot.  That's an EC2 issue but I've run into it on smaller clusters so it's good that starcluster checks. The time spent on validating instances was still a problem though, and I eventually commented out lines 1162-1163 in cluster.py

# while not self.is_cluster_up(enforce_size):
#     time.sleep(30)

I only did this since I knew the instances were in 'running' state for a while.  StarCluster setup for large clusters is going to be tricky especially on slow connections.  The setup time could be 30-60 minutes or more.  The --no-create option is essential in this case.  Is there a --skip-validation option?  I think some of this could be remedied by parallelizing ssh connections.  Does paramiko support parallel/threaded ssh commands?  That could be a good start.

StarCluster still did a great job considering the cluster size! Just some things to think about for anyone planning to scale beyond ~64 nodes.

Thanks again,
Adam