[StarCluster] starcluster starts but not all nodes added as exec nodes

Kyeong Soo (Joseph) Kim kyeongsoo.kim at gmail.com
Mon Apr 11 12:13:03 EDT 2011


Justin,

Great thanks for all your hard work and this great news.

It looks like the StarCluster finally becomes a serious tool for
research based on such large-scale computing, thanks mainly to you and
also to several others.

Certainly I will try this with 50+ nodes -- as soon as I finish coding
a new version of simulation program -- and let you know the result.

Regards,
Joseph


On Wed, Apr 6, 2011 at 4:09 PM, Justin Riley <justin.t.riley at gmail.com> wrote:
> Jeff/Joseph,
>
> Sorry for taking so long to follow up with this but I believe I've
> fixed this issue for good and you should now be able to launch 50+
> node clusters without issue. My original feeling was that the SGE
> install script was at fault, however, after several hours of digging I
> discovered that ssh-keyscan was failing when there were a large number
> of nodes. Long story short this meant that passwordless-ssh wasn't
> being setup fully for all nodes and so the SGE installer script could
> not connect to those nodes to add them to the queue. I found a much
> better way to populate the known_hosts file with all the nodes using
> paramiko instead of ssh-keyscan which is much faster in this case.
>
> If you haven't already please re-run 'python setup.py install' after
> pulling the latest code to test out the latest changes. I've also
> updated StarCluster perform the setup on all nodes concurrently using
> a thread pool so you should notice it's much faster for larger
> clusters. Please let me know if you have issues.
>
> Thanks,
>
> ~Justin
>
> On Wed, Mar 16, 2011 at 1:37 PM, Kyeong Soo (Joseph) Kim
> <kyeongsoo.kim at gmail.com> wrote:
>> Justin,
>> Please, find attached the said file.
>>
>> Regards,
>> Joseph
>>
>>
>> On Wed, Mar 16, 2011 at 4:38 PM, Justin Riley <jtriley at mit.edu> wrote:
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA1
>>>
>>> Joseph,
>>>
>>> Great thanks, can you also send me the /opt/sge6/ec2_sge.conf file please?
>>>
>>> ~Justin
>>>
>>> On 03/16/2011 12:29 PM, Kyeong Soo (Joseph) Kim wrote:
>>>> Hi Justin,
>>>>
>>>> Please, find attached the gzipped tar file of the logfiles under
>>>> install_logs directory.
>>>>
>>>> Note that the configuration is for 25-node (1 master and 24 slaves) cluster.
>>>>
>>>> Below is the time-sorted listing of log files under the same directory:
>>>>
>>>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 13:23
>>>> execd_install_node024_2011-03-16_13:23:11.log
>>>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>>> execd_install_node023_2011-03-16_11:13:37.log
>>>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>>> execd_install_node022_2011-03-16_11:13:36.log
>>>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>>> execd_install_node021_2011-03-16_11:13:36.log
>>>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>>> execd_install_node020_2011-03-16_11:13:32.log
>>>> -rw-r--r-- 1 kks kks  18K 2011-03-16 11:13
>>>> execd_install_master_2011-03-16_11:13:10.log
>>>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>>> execd_install_node017_2011-03-16_11:13:27.log
>>>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>>> execd_install_node018_2011-03-16_11:13:27.log
>>>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>>> execd_install_node019_2011-03-16_11:13:28.log
>>>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>>> execd_install_node016_2011-03-16_11:13:26.log
>>>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>>> execd_install_node014_2011-03-16_11:13:25.log
>>>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>>> execd_install_node015_2011-03-16_11:13:26.log
>>>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>>> execd_install_node012_2011-03-16_11:13:24.log
>>>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>>> execd_install_node013_2011-03-16_11:13:25.log
>>>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>>> execd_install_node010_2011-03-16_11:13:23.log
>>>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>>> execd_install_node011_2011-03-16_11:13:24.log
>>>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>>> execd_install_node008_2011-03-16_11:13:22.log
>>>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>>> execd_install_node009_2011-03-16_11:13:22.log
>>>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>>> execd_install_node006_2011-03-16_11:13:21.log
>>>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>>> execd_install_node007_2011-03-16_11:13:21.log
>>>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>>> execd_install_node004_2011-03-16_11:13:20.log
>>>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>>> execd_install_node005_2011-03-16_11:13:20.log
>>>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>>> execd_install_node003_2011-03-16_11:13:19.log
>>>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>>> execd_install_node001_2011-03-16_11:13:18.log
>>>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>>> execd_install_node002_2011-03-16_11:13:19.log
>>>> -rw-r--r-- 1 kks kks 3.1K 2011-03-16 11:13
>>>> execd_install_master_2011-03-16_11:13:17.log
>>>> -rw-r--r-- 1 kks kks 8.4K 2011-03-16 11:13
>>>> qmaster_install_master_2011-03-16_11:13:05.log
>>>>
>>>> As you can see, the installation of master has been duplicated and it
>>>> ended up with master, node001~node023; the top-most log for node024
>>>> was for the manual addition through "addnode" command later (i.e., 1
>>>> hour 10 mins after).
>>>>
>>>> Even with this slimmed down version of configurations (compared to the
>>>> original 125-node one), the chances that all nodes are properly
>>>> installed (i.e., 25 out of 25) were about 50% (last night and this
>>>> morning, I tried it about 10 times to set total five of 25-node
>>>> clusters).
>>>>
>>>> Regards,
>>>> Joseph
>>>>
>>>>
>>>> On Wed, Mar 16, 2011 at 3:57 PM, Justin Riley <jtriley at mit.edu> wrote:
>>>> Hi Jeff/Joseph,
>>>>
>>>> I just requested to up my EC2 instance limit so that I can test things
>>>> out at this scale and see what the issue is. In the mean time would you
>>>> mind sending me any logs found in /opt/sge6/default/common/install_logs
>>>> and also the /opt/sge6/ec2_sge.conf for a failed run?
>>>>
>>>> Also if this happens again you could try reinstalling SGE manually
>>>> assuming all the nodes are up:
>>>>
>>>> $ starcluster sshmaster mycluster
>>>> $ cd /opt/sge6
>>>> $ ./inst_sge -m -x -auto ./ec2_sge.conf
>>>>
>>>> ~Justin
>>>>
>>>> On 03/15/2011 06:30 PM, Kyeong Soo (Joseph) Kim wrote:
>>>>>>> Hi Jeff,
>>>>>>>
>>>>>>> I experienced the same thing with my 50-node configuration (c1.xlarge).
>>>>>>> Out of 50 nodes, only 29 nodes are successfully identified by the SGE.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Joseph
>>>>>>>
>>>>>>> On Sat, Mar 5, 2011 at 10:15 PM, Jeff White <jeff at decide.com> wrote:
>>>>>>>> I can frequently reproduce an issue where 'starcluster start' completes
>>>>>>>> without error, but not all nodes are added to the SGE pool, which I verify
>>>>>>>> by running 'qconf -sel' on the master. The latest example I have is creating
>>>>>>>> a 25-node cluster, where only the first 12 nodes are successfully installed.
>>>>>>>> The remaining instances are running and I can ssh to them but they aren't
>>>>>>>> running sge_execd. There are only install log files for the first 12 nodes
>>>>>>>> in /opt/sge6/default/common/install_logs. I have not found any clues in the
>>>>>>>> starcluster debug log or the logs inside master:/opt/sge6/.
>>>>>>>>
>>>>>>>> I am running starcluster development snapshot 8ef48a3 downloaded on
>>>>>>>> 2011-02-15, with the following relevant settings:
>>>>>>>>
>>>>>>>> NODE_IMAGE_ID=ami-8cf913e5
>>>>>>>> NODE_INSTANCE_TYPE = m1.small
>>>>>>>>
>>>>>>>> I have seen this behavior with the latest 32-bit and 64-bit starcluster
>>>>>>>> AMIs. Our workaround is to start a small cluster and progressively add nodes
>>>>>>>> one at a time, which is time-consuming.
>>>>>>>>
>>>>>>>> Has anyone else noticed this and have a better workaround or an idea for a
>>>>>>>> fix?
>>>>>>>>
>>>>>>>> jeff
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> StarCluster mailing list
>>>>>>>> StarCluster at mit.edu
>>>>>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>>>>>>
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> StarCluster mailing list
>>>>>>> StarCluster at mit.edu
>>>>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>>
>>>>>
>>>
>>> -----BEGIN PGP SIGNATURE-----
>>> Version: GnuPG v2.0.17 (GNU/Linux)
>>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>>>
>>> iEYEARECAAYFAk2A550ACgkQ4llAkMfDcrlR2gCeOoYMzl9U+z1owIq98JHBgLHi
>>> IngAniUwV6nq/hN6/TfxCBu1d2/MO5Ru
>>> =tXep
>>> -----END PGP SIGNATURE-----
>>>
>>
>> _______________________________________________
>> StarCluster mailing list
>> StarCluster at mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>>
>




More information about the StarCluster mailing list