Hi Joseph/Justin,<div><br></div><div>I did have similar problems starting larger clusters. There was actually 2 issues at play.</div><div><br></div><div><meta charset="utf-8">1) EC2 instances occasionally won't come up with ssh *ever*. In that case you have to reboot the instance and it should work. This could be something like 1 in 100 instances or just an anomaly but it's worth noting. The workaround I used was to manually verify all nodes are running and port 22 is open then run starcluster start -x.</div>
<div><br></div><div><meta charset="utf-8">2) StarCluster runs all ssh commands in a single process. There's overhead for each ssh call so this can end up taking a really long time (30-60 mins) in some cases. The solution I think is for starcluster to push the ssh commands into a queue and use multiple processes/threads to run them.</div>
<div><br></div><div>I'm happy to invest some time profiling this or adding parallel ssh support if there's interest. The latter may be best left to the pythonistas though ;)</div><div><br></div><div>I think that might explain "partial" SGE installs. The effort to move away from the sge installer is a good one as well. That's always a bit of a "fingers crossed" installer for me.</div>
<div><br></div><div>Best, </div><div>Adam</div><div><br><div class="gmail_quote">On Tue, Mar 15, 2011 at 6:29 PM, Justin Riley <span dir="ltr"><<a href="mailto:jtriley@mit.edu">jtriley@mit.edu</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">-----BEGIN PGP SIGNED MESSAGE-----<br>
Hash: SHA1<br>
<br>
Hi Joseph,<br>
<br>
Have a look here for Adam Kraut's account of launching an 80 node<br>
StarCluster:<br>
<br>
<a href="http://mailman.mit.edu/pipermail/starcluster/2010-December/000552.html" target="_blank">http://mailman.mit.edu/pipermail/starcluster/2010-December/000552.html</a><br>
<br>
It's *definitely* a matter of a huge delay involved in setting up a<br>
large number of nodes so be patient. This is because StarCluster was<br>
originally intended for 10-20 nodes and uses a fairly naive approach of<br>
waiting for all nodes to come up (including ssh) and then setting up<br>
each node one-by-one. I have several ideas on how to substantially speed<br>
this process up in the future but for now I'm concentrating on getting a<br>
release out within the next couple weeks.<br>
<br>
However, after the release I'd be interested in profiling StarCluster<br>
during such a large run and seeing where the slowness comes from in<br>
detail. I'm definitely interested in getting StarCluster to a point<br>
where it can launch large clusters in a reasonable amount of time. As it<br>
stands now I need to drastically up my instance limit in order to test<br>
and debug this properly. Would you be interested in profiling one of<br>
your large runs and sending me the output some time after the next release?<br>
<br>
~Justin<br>
<div><div></div><div class="h5"><br>
<br>
<br>
On 03/15/2011 06:03 PM, Kyeong Soo (Joseph) Kim wrote:<br>
> Hi Austin,<br>
><br>
> Yes, I requested to increase the limit to 300 (got a confirmation, of<br>
> course) and now successfully running a 50-node cluster (it took 32<br>
> mins BTW).<br>
> I wonder now if it simply is a matter of huge delay involved with<br>
> setting up such a large number nodes.<br>
><br>
> Regards,<br>
> Joseph<br>
><br>
><br>
> On Tue, Mar 15, 2011 at 9:53 PM, Austin Godber <<a href="mailto:godber@uberhip.com">godber@uberhip.com</a>> wrote:<br>
>> Does it work at 20 and fail at 21? I think Amazon still has a 20 AMIs<br>
>> limit, which you can request that they raise. Have you done that?<br>
>><br>
>> <a href="http://aws.amazon.com/ec2/faqs/#How_many_instances_can_I_run_in_Amazon_EC2" target="_blank">http://aws.amazon.com/ec2/faqs/#How_many_instances_can_I_run_in_Amazon_EC2</a><br>
>><br>
>> Austin<br>
>><br>
>> On 03/15/2011 05:29 PM, Kyeong Soo (Joseph) Kim wrote:<br>
>>> Hi Justin and All,<br>
>>><br>
>>> This is to report a failure in launching a large cluster with 125<br>
>>> nodes (c1.xlarge).<br>
>>><br>
>>> I tried to launch the said cluster two times but starcluster hung (for<br>
>>> more than hours) at the following steps:<br>
>>><br>
>>> .....<br>
>>><br>
>>>>>> Launching node121 (ami: ami-2857a641, type: c1.xlarge)<br>
>>>>>> Launching node122 (ami: ami-2857a641, type: c1.xlarge)<br>
>>>>>> Launching node123 (ami: ami-2857a641, type: c1.xlarge)<br>
>>>>>> Launching node124 (ami: ami-2857a641, type: c1.xlarge)<br>
>>>>>> Creating security group @sc-hnrlcluster...<br>
>>> Reservation:r-7c264911<br>
>>>>>> Waiting for cluster to come up... (updating every 30s)<br>
>>>>>> Waiting for all nodes to be in a 'running' state...<br>
>>> 125/125 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||<br>
>>> 100%<br>
>>>>>> Waiting for SSH to come up on all nodes...<br>
>>> 125/125 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||<br>
>>> 100%<br>
>>>>>> The master node is <a href="http://ec2-75-101-230-197.compute-1.amazonaws.com" target="_blank">ec2-75-101-230-197.compute-1.amazonaws.com</a><br>
>>>>>> Setting up the cluster...<br>
>>>>>> Attaching volume vol-467ecc2e to master node on /dev/sdz ...<br>
>>>>>> Configuring hostnames...<br>
>>>>>> Mounting EBS volume vol-467ecc2e on /home...<br>
>>>>>> Creating cluster user: kks (uid: 1001, gid: 1001)<br>
>>>>>> Configuring scratch space for user: kks<br>
>>>>>> Configuring /etc/hosts on each node<br>
>>><br>
>>> I have succeeded with the configuration up to 15 nodes so far.<br>
>>><br>
>>> Any idea?<br>
>>><br>
>>> With Regards,<br>
>>> Joseph<br>
>>> --<br>
>>> Kyeong Soo (Joseph) Kim, Ph.D.<br>
>>> Senior Lecturer in Networking<br>
>>> Room 112, Digital Technium<br>
>>> Multidisciplinary Nanotechnology Centre, College of Engineering<br>
>>> Swansea University, Singleton Park, Swansea SA2 8PP, Wales UK<br>
>>> TEL: <a href="tel:%2B44%20%280%291792%20602024">+44 (0)1792 602024</a><br>
>>> EMAIL: <a href="http://k.s.kim_at_swansea.ac.uk" target="_blank">k.s.kim_at_swansea.ac.uk</a><br>
>>> HOME: <a href="http://iat-hnrl.swan.ac.uk/" target="_blank">http://iat-hnrl.swan.ac.uk/</a> (group)<br>
>>> <a href="http://iat-hnrl.swan.ac.uk/~kks/" target="_blank">http://iat-hnrl.swan.ac.uk/~kks/</a> (personal)<br>
>>><br>
>>> _______________________________________________<br>
>>> StarCluster mailing list<br>
>>> <a href="mailto:StarCluster@mit.edu">StarCluster@mit.edu</a><br>
>>> <a href="http://mailman.mit.edu/mailman/listinfo/starcluster" target="_blank">http://mailman.mit.edu/mailman/listinfo/starcluster</a><br>
>><br>
>> _______________________________________________<br>
>> StarCluster mailing list<br>
>> <a href="mailto:StarCluster@mit.edu">StarCluster@mit.edu</a><br>
>> <a href="http://mailman.mit.edu/mailman/listinfo/starcluster" target="_blank">http://mailman.mit.edu/mailman/listinfo/starcluster</a><br>
>><br>
><br>
> _______________________________________________<br>
> StarCluster mailing list<br>
> <a href="mailto:StarCluster@mit.edu">StarCluster@mit.edu</a><br>
> <a href="http://mailman.mit.edu/mailman/listinfo/starcluster" target="_blank">http://mailman.mit.edu/mailman/listinfo/starcluster</a><br>
<br>
</div></div>-----BEGIN PGP SIGNATURE-----<br>
Version: GnuPG v2.0.17 (GNU/Linux)<br>
Comment: Using GnuPG with Mozilla - <a href="http://enigmail.mozdev.org/" target="_blank">http://enigmail.mozdev.org/</a><br>
<br>
iEYEARECAAYFAk1/6F0ACgkQ4llAkMfDcrnDXQCfap8Q42lXaahHbQu3No3pp5Nz<br>
x3oAn2Rl50ImIxnWBxf9di1vaFgLm1AO<br>
=wUaZ<br>
-----END PGP SIGNATURE-----<br>
<div><div></div><div class="h5">_______________________________________________<br>
StarCluster mailing list<br>
<a href="mailto:StarCluster@mit.edu">StarCluster@mit.edu</a><br>
<a href="http://mailman.mit.edu/mailman/listinfo/starcluster" target="_blank">http://mailman.mit.edu/mailman/listinfo/starcluster</a><br>
</div></div></blockquote></div><br></div>