[StarCluster] Starcluster stuck during setup

Thu Mar 27 16:16:41 EDT 2014

Hi Nik,

I was not confused by the similarity of the /etc/hosts file, rather the
fact that they were identical and the step was still running!

I will apply those patches and see if it improves performance. Anecdotally,
it seems that EC2 was just behaving poorly that day, I have had no trouble
starting a 50 node cluster today.

Thanks for your help,
Cory

On Tue, Mar 25, 2014 at 11:38 PM, Niklas Krumm <nkrumm at gmail.com> wrote:

> Hi Cory,
>
> I believe the reason that all the /etc/hosts files are the same is because
> each node is using the /etc/hosts file to resolve hostnames across your
> cluster. For large clusters, this poses a bit of a problem, as adding a
> single node requires updating all other /etc/hosts files (although this is
> not strictly needed if the nodes dont need to communicate between each
> other). Furthermore, when adding several nodes, this process is repeated
> for each added node... such that adding N nodes actually is a O((N**2)/2)
> operation (I think.).
>
> In any case, there are two modifications to the SC code that can help
> alleviate this. The first is @FinchPowers' PR #347 (
> https://github.com/jtriley/StarCluster/pull/347), which favors file copy
> instead of streaming, and dramatically improves the speed of updating
> files. The second is one I proposed, and uses a DNS server running on the
> master to resolve hostnames. This basically removes the requirement to
> update the /etc/hosts file on all nodes when a new node is added; instead,
> the entry is only added to the master's /etc/host (and served up by
> dnsmasq). That PR is #321 (https://github.com/jtriley/StarCluster/pull/321).
> I use both of these and can add nodes to even very large (300+ node)
> clusters in a constant amount of time. I should mention there are some
> other N or N**2/2 operations in updating the SGE environment.. these are
> not strictly needed and can be turned off. I would be happy to discuss this
> or propse a PR with those modifications in the future...
>
> Good luck,
> Nik
>
>
> On Mar 25, 2014, at 6:03 PM, Cory Dolphin wrote:
>
> To follow up, after using the hack in the link, I still find that the
> cluster takes a LONG time to configure etc/hosts. Any idea why this might
> be happening?
> Weirder yet, all of the nodes have an identical /etc/hosts file.
> I ran a loop sshing into all of the 31 nodes (30 + master) and cat'd the
> /etc/hosts file.
>
>
> On Tue, Mar 25, 2014 at 8:11 PM, Cory Dolphin <wcdolphin at gmail.com> wrote:
>
>>
>> Whenever I try and add a node to a spot instance cluster, starcluster
>> does not properly wait for the spot request to be fulfilled, and instead
>> errors out:
>>
>> starcluster addnode mycluster
>> StarCluster - (http://star.mit.edu/cluster) (v. 0.95.3)
>> Software Tools for Academics and Researchers (STAR)
>> Please submit bug reports to starcluster at mit.edu
>>
>> >>> Launching node(s): node030
>> SpotInstanceRequest:sir-85f44249
>> >>> Waiting for spot requests to propagate...
>> >>> Waiting for node(s) to come up... (updating every 30s)
>> >>> Waiting for all nodes to be in a 'running' state...
>> 30/30 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
>> 100%
>> >>> Waiting for SSH to come up on all nodes...
>> 30/30 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
>> 100%
>> >>> Waiting for cluster to come up took 1.179 mins
>> !!! ERROR - node 'node030' does not exist
>>
>>
>> Once the spot instance request is fulfilled, the instance does not have a
>> name. Looks like someone else had this problem quite recently<http://star.mit.edu/cluster/mlarchives/2058.html>.
>> I wonder what the difference between our setup and yours is?
>>
>>
>> On Tue, Mar 25, 2014 at 7:42 PM, Rayson Ho <raysonlogin at gmail.com> wrote:
>>
>>> If you really have a slow connection, you may consider bootstrapping
>>> StarCluster on AWS - ie. configure an m1.small (or even t1.micro) and
>>> install StarCluster on that node. In fact, there's a CloudFormation
>>> template for that:
>>>
>>> http://aws.typepad.com/aws/2012/06/ec2-spot-instance-updates-auto-scaling-and-cloudformation-integration-new-sample-app-1.html
>>> . On the other hand, it's way easier to do it by hand and just launch
>>> an instance from the standard Ubuntu AMI, and then install StarCluster
>>> on that instance.
>>>
>>> And like others mentioned, most large StarClusters are launched by
>>> first starting a small cluster, and then grow it dynamically. You
>>> should be able to run the addnode command from your qmaster node
>>> provided that you have StarCluster setup there (note that your AWS key
>>> will be on the EC2 instance so it is slightly more risky if security
>>> is the main concern).
>>>
>>> Rayson
>>>
>>> ==================================================
>>> Open Grid Scheduler - The Official Open Source Grid Engine
>>> http://gridscheduler.sourceforge.net/
>>> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>>>
>>>
>>> On Tue, Mar 25, 2014 at 8:04 AM, Butson, Christopher <cbutson at mcw.edu>
>>> wrote:
>>> > Interesting: I let it go and it eventually continued but it took over
>>> an hour to Configuring passwordless ssh for root. Still waiting for the
>>> cluster to finish startup...
>>> >
>>> > Christopher R. Butson, Ph.D.
>>> > Associate Professor
>>> > Biotechnology & Bioengineering Center
>>> > Departments of Neurology, Neurosurgery, Psychiatry & Behavioral
>>> Medicine
>>> > Medical College of Wisconsin
>>> > (414) 955-2678
>>> > cbutson at mcw.edu<mailto:cbutson at mcw.edu>
>>> >
>>> >
>>> > From: <Butson>, Christopher Butson <cbutson at mcw.edu<mailto:
>>> cbutson at mcw.edu>>
>>> > Date: Tuesday, March 25, 2014 12:13 PM
>>> > To: "starcluster at mit.edu<mailto:starcluster at mit.edu>" <
>>> starcluster at mit.edu<mailto:starcluster at mit.edu>>
>>> > Subject: Starcluster stuck during setup
>>> >
>>> > I'm on a slow internet connection overseas, trying to initiate a
>>> cluster using StarCluster. Once I type "starcluster start mycluster"
>>> everything seems to go ok but it gets stuck at the following point and
>>> never seems to get past it:
>>> >>>> Mounting all NFS export path(s) on 79 worker node(s)
>>> > 79/79
>>> |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> >>>> Setting up NFS took 2.777 mins
>>> >>>> Configuring passwordless ssh for root
>>> >
>>> > Any idea why this might occur? Thanks,
>>> > Chris
>>> >
>>> > Christopher R. Butson, Ph.D.
>>> > Associate Professor
>>> > Biotechnology & Bioengineering Center
>>> > Departments of Neurology, Neurosurgery, Psychiatry & Behavioral
>>> Medicine
>>> > Medical College of Wisconsin
>>> > (414) 955-2678
>>> > cbutson at mcw.edu<mailto:cbutson at mcw.edu>
>>> >
>>> >
>>> > _______________________________________________
>>> > StarCluster mailing list
>>> > StarCluster at mit.edu
>>> > http://mailman.mit.edu/mailman/listinfo/starcluster
>>> _______________________________________________
>>> StarCluster mailing list
>>> StarCluster at mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>
>>
>>
> _______________________________________________
> StarCluster mailing list
> StarCluster at mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20140327/357f4379/attachment.htm