[StarCluster] starcluster starts but not all nodes added as exec nodes

Wed Apr 6 13:49:31 EDT 2011

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Jeff,

Did you reinstall after pulling the latest code using "python setup.py
install"? If so, what version of boto do you have installed? You can
check this with:

% python -c 'import boto; print boto.Version'
2.0b4

The version should be 2.0b4 as above.

~Justin

On 04/06/2011 01:04 PM, Jeff White wrote:
> Hi Justin,
> 
> Thanks much for your effort on this. I got this error upon running
> 'starcluster -s 25 start jswtest'. I have not altered my config file
> from the one I sent you previously.
> 
> PID: 5530 config.py:515 - DEBUG - Loading config
> PID: 5530 config.py:108 - DEBUG - Loading file:
> /home/jsw/.starcluster/config
> PID: 5530 config.py:515 - DEBUG - Loading config
> PID: 5530 config.py:108 - DEBUG - Loading file:
> /home/jsw/.starcluster/config
> PID: 5530 awsutils.py:54 - DEBUG - creating self._conn w/
> connection_authenticator kwargs = {'path': '/', 'region': None, 'port':
> None, 'is_secure': True}
> PID: 5530 start.py:167 - INFO - Using default cluster template: smallcluster
> PID: 5530 cluster.py:1310 - INFO - Validating cluster template settings...
> PID: 5530 cli.py:184 - DEBUG - Traceback (most recent call last):
>   File "/home/jsw/jtriley-StarCluster-dfba6ef/starcluster/cli.py", line
> 160, in main
>     sc.execute(args)
>   File
> "/home/jsw/jtriley-StarCluster-dfba6ef/starcluster/commands/start.py",
> line 175, in execute
>     scluster._validate(validate_running=validate_running)
>   File "/home/jsw/jtriley-StarCluster-dfba6ef/starcluster/cluster.py",
> line 1322, in _validate
>     self._validate_instance_types()
>   File "/home/jsw/jtriley-StarCluster-dfba6ef/starcluster/cluster.py",
> line 1458, in _validate_instance_types
>     self.__check_platform(node_image_id, node_instance_type)
>   File "/home/jsw/jtriley-StarCluster-dfba6ef/starcluster/cluster.py",
> line 1419, in __check_platform
>     image_is_hvm = (image.virtualization_type == "hvm")
> AttributeError: 'Image' object has no attribute 'virtualization_type'
> 
> PID: 5530 cli.py:129 - ERROR - Oops! Looks like you've found a bug in
> StarCluster
> PID: 5530 cli.py:130 - ERROR - Debug file written to:
> /tmp/starcluster-debug-jsw.log
> PID: 5530 cli.py:131 - ERROR - Look for lines starting with PID: 5530
> PID: 5530 cli.py:132 - ERROR - Please submit this file, minus any
> private information,
> PID: 5530 cli.py:133 - ERROR - to starcluster at mit.edu
> <mailto:starcluster at mit.edu>
> 
> 
> 
> On Wed, Apr 6, 2011 at 8:09 AM, Justin Riley <justin.t.riley at gmail.com
> <mailto:justin.t.riley at gmail.com>> wrote:
> 
>     Jeff/Joseph,
> 
>     Sorry for taking so long to follow up with this but I believe I've
>     fixed this issue for good and you should now be able to launch 50+
>     node clusters without issue. My original feeling was that the SGE
>     install script was at fault, however, after several hours of digging I
>     discovered that ssh-keyscan was failing when there were a large number
>     of nodes. Long story short this meant that passwordless-ssh wasn't
>     being setup fully for all nodes and so the SGE installer script could
>     not connect to those nodes to add them to the queue. I found a much
>     better way to populate the known_hosts file with all the nodes using
>     paramiko instead of ssh-keyscan which is much faster in this case.
> 
>     If you haven't already please re-run 'python setup.py install' after
>     pulling the latest code to test out the latest changes. I've also
>     updated StarCluster perform the setup on all nodes concurrently using
>     a thread pool so you should notice it's much faster for larger
>     clusters. Please let me know if you have issues.
> 
>     Thanks,
> 
>     ~Justin
> 
>     On Wed, Mar 16, 2011 at 1:37 PM, Kyeong Soo (Joseph) Kim
>     <kyeongsoo.kim at gmail.com <mailto:kyeongsoo.kim at gmail.com>> wrote:
>     > Justin,
>     > Please, find attached the said file.
>     >
>     > Regards,
>     > Joseph
>     >
>     >
>     > On Wed, Mar 16, 2011 at 4:38 PM, Justin Riley <jtriley at mit.edu
>     <mailto:jtriley at mit.edu>> wrote:
> Joseph,
> 
> Great thanks, can you also send me the /opt/sge6/ec2_sge.conf
>>     file please?
> 
> ~Justin
> 
> On 03/16/2011 12:29 PM, Kyeong Soo (Joseph) Kim wrote:
>>     >>> Hi Justin,
>>     >>>
>>     >>> Please, find attached the gzipped tar file of the logfiles under
>>     >>> install_logs directory.
>>     >>>
>>     >>> Note that the configuration is for 25-node (1 master and 24
>>     slaves) cluster.
>>     >>>
>>     >>> Below is the time-sorted listing of log files under the same
>>     directory:
>>     >>>
>>     >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 13:23
>>     >>> execd_install_node024_2011-03-16_13:23:11.log
>>     >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>     >>> execd_install_node023_2011-03-16_11:13:37.log
>>     >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>     >>> execd_install_node022_2011-03-16_11:13:36.log
>>     >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>     >>> execd_install_node021_2011-03-16_11:13:36.log
>>     >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>     >>> execd_install_node020_2011-03-16_11:13:32.log
>>     >>> -rw-r--r-- 1 kks kks  18K 2011-03-16 11:13
>>     >>> execd_install_master_2011-03-16_11:13:10.log
>>     >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>     >>> execd_install_node017_2011-03-16_11:13:27.log
>>     >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>     >>> execd_install_node018_2011-03-16_11:13:27.log
>>     >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>     >>> execd_install_node019_2011-03-16_11:13:28.log
>>     >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>     >>> execd_install_node016_2011-03-16_11:13:26.log
>>     >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>     >>> execd_install_node014_2011-03-16_11:13:25.log
>>     >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>     >>> execd_install_node015_2011-03-16_11:13:26.log
>>     >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>     >>> execd_install_node012_2011-03-16_11:13:24.log
>>     >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>     >>> execd_install_node013_2011-03-16_11:13:25.log
>>     >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>     >>> execd_install_node010_2011-03-16_11:13:23.log
>>     >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>     >>> execd_install_node011_2011-03-16_11:13:24.log
>>     >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>     >>> execd_install_node008_2011-03-16_11:13:22.log
>>     >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>     >>> execd_install_node009_2011-03-16_11:13:22.log
>>     >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>     >>> execd_install_node006_2011-03-16_11:13:21.log
>>     >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>     >>> execd_install_node007_2011-03-16_11:13:21.log
>>     >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>     >>> execd_install_node004_2011-03-16_11:13:20.log
>>     >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>     >>> execd_install_node005_2011-03-16_11:13:20.log
>>     >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>     >>> execd_install_node003_2011-03-16_11:13:19.log
>>     >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>     >>> execd_install_node001_2011-03-16_11:13:18.log
>>     >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
>>     >>> execd_install_node002_2011-03-16_11:13:19.log
>>     >>> -rw-r--r-- 1 kks kks 3.1K 2011-03-16 11:13
>>     >>> execd_install_master_2011-03-16_11:13:17.log
>>     >>> -rw-r--r-- 1 kks kks 8.4K 2011-03-16 11:13
>>     >>> qmaster_install_master_2011-03-16_11:13:05.log
>>     >>>
>>     >>> As you can see, the installation of master has been duplicated
>>     and it
>>     >>> ended up with master, node001~node023; the top-most log for node024
>>     >>> was for the manual addition through "addnode" command later (i.e., 1
>>     >>> hour 10 mins after).
>>     >>>
>>     >>> Even with this slimmed down version of configurations (compared
>>     to the
>>     >>> original 125-node one), the chances that all nodes are properly
>>     >>> installed (i.e., 25 out of 25) were about 50% (last night and this
>>     >>> morning, I tried it about 10 times to set total five of 25-node
>>     >>> clusters).
>>     >>>
>>     >>> Regards,
>>     >>> Joseph
>>     >>>
>>     >>>
>>     >>> On Wed, Mar 16, 2011 at 3:57 PM, Justin Riley <jtriley at mit.edu
>>     <mailto:jtriley at mit.edu>> wrote:
>>     >>> Hi Jeff/Joseph,
>>     >>>
>>     >>> I just requested to up my EC2 instance limit so that I can test
>>     things
>>     >>> out at this scale and see what the issue is. In the mean time
>>     would you
>>     >>> mind sending me any logs found in
>>     /opt/sge6/default/common/install_logs
>>     >>> and also the /opt/sge6/ec2_sge.conf for a failed run?
>>     >>>
>>     >>> Also if this happens again you could try reinstalling SGE manually
>>     >>> assuming all the nodes are up:
>>     >>>
>>     >>> $ starcluster sshmaster mycluster
>>     >>> $ cd /opt/sge6
>>     >>> $ ./inst_sge -m -x -auto ./ec2_sge.conf
>>     >>>
>>     >>> ~Justin
>>     >>>
>>     >>> On 03/15/2011 06:30 PM, Kyeong Soo (Joseph) Kim wrote:
>>     >>>>>> Hi Jeff,
>>     >>>>>>
>>     >>>>>> I experienced the same thing with my 50-node configuration
>>     (c1.xlarge).
>>     >>>>>> Out of 50 nodes, only 29 nodes are successfully identified by
>>     the SGE.
>>     >>>>>>
>>     >>>>>> Regards,
>>     >>>>>> Joseph
>>     >>>>>>
>>     >>>>>> On Sat, Mar 5, 2011 at 10:15 PM, Jeff White <jeff at decide.com
>>     <mailto:jeff at decide.com>> wrote:
>>     >>>>>>> I can frequently reproduce an issue where 'starcluster
>>     start' completes
>>     >>>>>>> without error, but not all nodes are added to the SGE pool,
>>     which I verify
>>     >>>>>>> by running 'qconf -sel' on the master. The latest example I
>>     have is creating
>>     >>>>>>> a 25-node cluster, where only the first 12 nodes are
>>     successfully installed.
>>     >>>>>>> The remaining instances are running and I can ssh to them
>>     but they aren't
>>     >>>>>>> running sge_execd. There are only install log files for the
>>     first 12 nodes
>>     >>>>>>> in /opt/sge6/default/common/install_logs. I have not found
>>     any clues in the
>>     >>>>>>> starcluster debug log or the logs inside master:/opt/sge6/.
>>     >>>>>>>
>>     >>>>>>> I am running starcluster development snapshot 8ef48a3
>>     downloaded on
>>     >>>>>>> 2011-02-15, with the following relevant settings:
>>     >>>>>>>
>>     >>>>>>> NODE_IMAGE_ID=ami-8cf913e5
>>     >>>>>>> NODE_INSTANCE_TYPE = m1.small
>>     >>>>>>>
>>     >>>>>>> I have seen this behavior with the latest 32-bit and 64-bit
>>     starcluster
>>     >>>>>>> AMIs. Our workaround is to start a small cluster and
>>     progressively add nodes
>>     >>>>>>> one at a time, which is time-consuming.
>>     >>>>>>>
>>     >>>>>>> Has anyone else noticed this and have a better workaround or
>>     an idea for a
>>     >>>>>>> fix?
>>     >>>>>>>
>>     >>>>>>> jeff
>>     >>>>>>>
>>     >>>>>>>
>>     >>>>>>> _______________________________________________
>>     >>>>>>> StarCluster mailing list
>>     >>>>>>> StarCluster at mit.edu <mailto:StarCluster at mit.edu>
>>     >>>>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>     >>>>>>>
>>     >>>>>>>
>>     >>>>>> _______________________________________________
>>     >>>>>> StarCluster mailing list
>>     >>>>>> StarCluster at mit.edu <mailto:StarCluster at mit.edu>
>>     >>>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>     >>>
>>     >>>>
> 
>     >>
>     >
>     > _______________________________________________
>     > StarCluster mailing list
>     > StarCluster at mit.edu <mailto:StarCluster at mit.edu>
>     > http://mailman.mit.edu/mailman/listinfo/starcluster
>     >
>     >

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk2cp6sACgkQ4llAkMfDcrkEPQCeKo3XQ8ilWlv89E76NTReqBaz
k68AoIur9985wTYnBKP4+cnKkKwMyL9i
=QBz+
-----END PGP SIGNATURE-----