[StarCluster] Issue creating a cluster of 30 nodes with starcluster

Paolo Di Tommaso Paolo.DiTommaso at crg.eu
Wed Nov 9 04:43:38 EST 2011


Hi Sumita,

EBS instances are faster to boot by design. See here http://goo.gl/LOqgb

I had a problem similar to your and I've solved updating to the latest version and using EBS instances.

About your question, I'm not a StarCluster developer so I cannot say for sure, but I think it needs that all nodes are up because to configure the cluster you need to know the IP addresses of all nodes,  as well as all nodes need to know the IP of the others (to set up /etc/hosts for example).

Said that, in my opinion the current bottleneck with StarCluster is that it configures nodes in a serial way, one after another (or at lest so it seems looking at benchmark result), and this create a huge problem if you want to use it to deploy large clusters (with 50 or more nodes).


Cheers,
Paolo



On Nov 9, 2011, at 1:00 AM, Sumita Sinha wrote:

Thanks for the quick response.


I tried with instance-store instances.Is<http://instances.Is> there any reason that EBS backed instances take less time to boot.

I tried creating the  cluster again with 30 nodes, this time it was successfully done in 14min

When a cluster create request is sent i see that the message on the terminal
>>>Waiting for all nodes to be in a 'running' state...
>>> Waiting for SSH to come up on all nodes...
>>> Setting up the cluster...
>>> Configuring hostnames...
>>> Creating cluster user: sgeadmin (uid: 1001, gid: 1001)

So when any node is up and running in EC2, does starcluster wait for all the nodes to be up and then it starts configuring them all at one time.
Is there any parameter in the config file or any options in the starcluster start command that says "configuration of the cluster and installing SGE/Configuring NFS  to be a parallel operation. any node should not wait for the other nodes to be up for getiing configured that's if we post a job on that ready node it should start executing the job with the available no of nodes that are running and configured."

If the above is not possible  , is there any specific reason while starting a cluster, starcluster does the configuration of nodes only when all are running.
If anything bad happens at the EC2 level and some of the nodes are taking a lot of time to start, is there any "fault tolerant technique" or "time out" .

Regards
Sumita




On Tue, Nov 8, 2011 at 7:55 PM, Paolo Di Tommaso <Paolo.DiTommaso at crg.eu<mailto:Paolo.DiTommaso at crg.eu>> wrote:
Are you using instance-store instance or EBS backed instances?

The latter are much more faster to boot.


Cheers,
Paolo



On Nov 8, 2011, at 3:12 PM, Justin Riley wrote:


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Sumita,

Were you using spot instances? If not I believe there's a default limit of 20 instances by default for flat-rate instances which *could* be related to your issue. With spot instances you can create up to 100 instances by default. So, if you need more than 20 nodes and do not wish to submit a request to Amazon to increase your flat-rate instance limit, you should be using spot instances:

$ starcluster start -s 30 -b 0.50 mycluster

With that said, StarCluster has no limit to the number of nodes you can create, however, as you've seen, sometimes EC2 instances can take longer to become 'running' than usual. Unfortunately this is purely an EC2 back-end issue that cannot be resolved directly by StarCluster. In my experience 22 minutes *is* quite a while to wait for any instance to come up, however, I have had instances take up to 15 min before in the past so this is not a total surprise to me.

In the future if you run into this problem of waiting for an instance to change from 'pending' to 'running' for too long (e.g. 15min+) I would recommend simply terminating the faulty instance from the AWS console and then restart the cluster using:

$ starcluster restart mycluster

This should reboot all the currently running instances and begin configuring the cluster and avoid having to terminate the entire cluster and lose instance hours.

HTH,

~Justin

On 11/8/11 6:39 AM, Sumita Sinha wrote:
> Hello ,
>
> Currently working with starcluster on EC2.
>
> Tried creating a cluster with 30 nodes of type m1.small using AMI - ami-8cf913e5.
> Cluster creation was never completed as i found out that one node node025 was showing pending status.
> I waited for almost 22 minutes then terminated the cluster.
> Cluster was terminated properly. Is there any limit to the creation of nodes .
>
>
>
>
> --
> Regards
> Sumita Sinha
>
>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk65OL4ACgkQ4llAkMfDcrm9MACghU/Ey4v653fsD8XmpbQKONNp
vdkAniIfFExWjqGAOWRolMrtePHfl4AL
=Q8NI
-----END PGP SIGNATURE-----

_______________________________________________
StarCluster mailing list
StarCluster at mit.edu<mailto:StarCluster at mit.edu>
http://mailman.mit.edu/mailman/listinfo/starcluster




--
Regards
Sumita Sinha



-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20111109/74092c33/attachment.htm


More information about the StarCluster mailing list