[StarCluster] starcluster starts but not all nodes added as exec nodes
Jeff White
jeff at decide.com
Wed Apr 6 13:04:49 EDT 2011
Hi Justin,
Thanks much for your effort on this. I got this error upon running
'starcluster -s 25 start jswtest'. I have not altered my config file from
the one I sent you previously.
PID: 5530 config.py:515 - DEBUG - Loading config
PID: 5530 config.py:108 - DEBUG - Loading file:
/home/jsw/.starcluster/config
PID: 5530 config.py:515 - DEBUG - Loading config
PID: 5530 config.py:108 - DEBUG - Loading file:
/home/jsw/.starcluster/config
PID: 5530 awsutils.py:54 - DEBUG - creating self._conn w/
connection_authenticator kwargs = {'path': '/', 'region': None, 'port':
None, 'is_secure': True}
PID: 5530 start.py:167 - INFO - Using default cluster template: smallcluster
PID: 5530 cluster.py:1310 - INFO - Validating cluster template settings...
PID: 5530 cli.py:184 - DEBUG - Traceback (most recent call last):
File "/home/jsw/jtriley-StarCluster-dfba6ef/starcluster/cli.py", line 160,
in main
sc.execute(args)
File
"/home/jsw/jtriley-StarCluster-dfba6ef/starcluster/commands/start.py", line
175, in execute
scluster._validate(validate_running=validate_running)
File "/home/jsw/jtriley-StarCluster-dfba6ef/starcluster/cluster.py", line
1322, in _validate
self._validate_instance_types()
File "/home/jsw/jtriley-StarCluster-dfba6ef/starcluster/cluster.py", line
1458, in _validate_instance_types
self.__check_platform(node_image_id, node_instance_type)
File "/home/jsw/jtriley-StarCluster-dfba6ef/starcluster/cluster.py", line
1419, in __check_platform
image_is_hvm = (image.virtualization_type == "hvm")
AttributeError: 'Image' object has no attribute 'virtualization_type'
PID: 5530 cli.py:129 - ERROR - Oops! Looks like you've found a bug in
StarCluster
PID: 5530 cli.py:130 - ERROR - Debug file written to:
/tmp/starcluster-debug-jsw.log
PID: 5530 cli.py:131 - ERROR - Look for lines starting with PID: 5530
PID: 5530 cli.py:132 - ERROR - Please submit this file, minus any private
information,
PID: 5530 cli.py:133 - ERROR - to starcluster at mit.edu
On Wed, Apr 6, 2011 at 8:09 AM, Justin Riley <justin.t.riley at gmail.com>wrote:
> Jeff/Joseph,
>
> Sorry for taking so long to follow up with this but I believe I've
> fixed this issue for good and you should now be able to launch 50+
> node clusters without issue. My original feeling was that the SGE
> install script was at fault, however, after several hours of digging I
> discovered that ssh-keyscan was failing when there were a large number
> of nodes. Long story short this meant that passwordless-ssh wasn't
> being setup fully for all nodes and so the SGE installer script could
> not connect to those nodes to add them to the queue. I found a much
> better way to populate the known_hosts file with all the nodes using
> paramiko instead of ssh-keyscan which is much faster in this case.
>
> If you haven't already please re-run 'python setup.py install' after
> pulling the latest code to test out the latest changes. I've also
> updated StarCluster perform the setup on all nodes concurrently using
> a thread pool so you should notice it's much faster for larger
> clusters. Please let me know if you have issues.
>
> Thanks,
>
> ~Justin
>
> On Wed, Mar 16, 2011 at 1:37 PM, Kyeong Soo (Joseph) Kim
> <kyeongsoo.kim at gmail.com> wrote:
> > Justin,
> > Please, find attached the said file.
> >
> > Regards,
> > Joseph
> >
> >
> > On Wed, Mar 16, 2011 at 4:38 PM, Justin Riley <jtriley at mit.edu> wrote:
> >> -----BEGIN PGP SIGNED MESSAGE-----
> >> Hash: SHA1
> >>
> >> Joseph,
> >>
> >> Great thanks, can you also send me the /opt/sge6/ec2_sge.conf file
> please?
> >>
> >> ~Justin
> >>
> >> On 03/16/2011 12:29 PM, Kyeong Soo (Joseph) Kim wrote:
> >>> Hi Justin,
> >>>
> >>> Please, find attached the gzipped tar file of the logfiles under
> >>> install_logs directory.
> >>>
> >>> Note that the configuration is for 25-node (1 master and 24 slaves)
> cluster.
> >>>
> >>> Below is the time-sorted listing of log files under the same directory:
> >>>
> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 13:23
> >>> execd_install_node024_2011-03-16_13:23:11.log
> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >>> execd_install_node023_2011-03-16_11:13:37.log
> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >>> execd_install_node022_2011-03-16_11:13:36.log
> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >>> execd_install_node021_2011-03-16_11:13:36.log
> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >>> execd_install_node020_2011-03-16_11:13:32.log
> >>> -rw-r--r-- 1 kks kks 18K 2011-03-16 11:13
> >>> execd_install_master_2011-03-16_11:13:10.log
> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >>> execd_install_node017_2011-03-16_11:13:27.log
> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >>> execd_install_node018_2011-03-16_11:13:27.log
> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >>> execd_install_node019_2011-03-16_11:13:28.log
> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >>> execd_install_node016_2011-03-16_11:13:26.log
> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >>> execd_install_node014_2011-03-16_11:13:25.log
> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >>> execd_install_node015_2011-03-16_11:13:26.log
> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >>> execd_install_node012_2011-03-16_11:13:24.log
> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >>> execd_install_node013_2011-03-16_11:13:25.log
> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >>> execd_install_node010_2011-03-16_11:13:23.log
> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >>> execd_install_node011_2011-03-16_11:13:24.log
> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >>> execd_install_node008_2011-03-16_11:13:22.log
> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >>> execd_install_node009_2011-03-16_11:13:22.log
> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >>> execd_install_node006_2011-03-16_11:13:21.log
> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >>> execd_install_node007_2011-03-16_11:13:21.log
> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >>> execd_install_node004_2011-03-16_11:13:20.log
> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >>> execd_install_node005_2011-03-16_11:13:20.log
> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >>> execd_install_node003_2011-03-16_11:13:19.log
> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >>> execd_install_node001_2011-03-16_11:13:18.log
> >>> -rw-r--r-- 1 kks kks 2.9K 2011-03-16 11:13
> >>> execd_install_node002_2011-03-16_11:13:19.log
> >>> -rw-r--r-- 1 kks kks 3.1K 2011-03-16 11:13
> >>> execd_install_master_2011-03-16_11:13:17.log
> >>> -rw-r--r-- 1 kks kks 8.4K 2011-03-16 11:13
> >>> qmaster_install_master_2011-03-16_11:13:05.log
> >>>
> >>> As you can see, the installation of master has been duplicated and it
> >>> ended up with master, node001~node023; the top-most log for node024
> >>> was for the manual addition through "addnode" command later (i.e., 1
> >>> hour 10 mins after).
> >>>
> >>> Even with this slimmed down version of configurations (compared to the
> >>> original 125-node one), the chances that all nodes are properly
> >>> installed (i.e., 25 out of 25) were about 50% (last night and this
> >>> morning, I tried it about 10 times to set total five of 25-node
> >>> clusters).
> >>>
> >>> Regards,
> >>> Joseph
> >>>
> >>>
> >>> On Wed, Mar 16, 2011 at 3:57 PM, Justin Riley <jtriley at mit.edu> wrote:
> >>> Hi Jeff/Joseph,
> >>>
> >>> I just requested to up my EC2 instance limit so that I can test things
> >>> out at this scale and see what the issue is. In the mean time would you
> >>> mind sending me any logs found in /opt/sge6/default/common/install_logs
> >>> and also the /opt/sge6/ec2_sge.conf for a failed run?
> >>>
> >>> Also if this happens again you could try reinstalling SGE manually
> >>> assuming all the nodes are up:
> >>>
> >>> $ starcluster sshmaster mycluster
> >>> $ cd /opt/sge6
> >>> $ ./inst_sge -m -x -auto ./ec2_sge.conf
> >>>
> >>> ~Justin
> >>>
> >>> On 03/15/2011 06:30 PM, Kyeong Soo (Joseph) Kim wrote:
> >>>>>> Hi Jeff,
> >>>>>>
> >>>>>> I experienced the same thing with my 50-node configuration
> (c1.xlarge).
> >>>>>> Out of 50 nodes, only 29 nodes are successfully identified by the
> SGE.
> >>>>>>
> >>>>>> Regards,
> >>>>>> Joseph
> >>>>>>
> >>>>>> On Sat, Mar 5, 2011 at 10:15 PM, Jeff White <jeff at decide.com>
> wrote:
> >>>>>>> I can frequently reproduce an issue where 'starcluster start'
> completes
> >>>>>>> without error, but not all nodes are added to the SGE pool, which I
> verify
> >>>>>>> by running 'qconf -sel' on the master. The latest example I have is
> creating
> >>>>>>> a 25-node cluster, where only the first 12 nodes are successfully
> installed.
> >>>>>>> The remaining instances are running and I can ssh to them but they
> aren't
> >>>>>>> running sge_execd. There are only install log files for the first
> 12 nodes
> >>>>>>> in /opt/sge6/default/common/install_logs. I have not found any
> clues in the
> >>>>>>> starcluster debug log or the logs inside master:/opt/sge6/.
> >>>>>>>
> >>>>>>> I am running starcluster development snapshot 8ef48a3 downloaded on
> >>>>>>> 2011-02-15, with the following relevant settings:
> >>>>>>>
> >>>>>>> NODE_IMAGE_ID=ami-8cf913e5
> >>>>>>> NODE_INSTANCE_TYPE = m1.small
> >>>>>>>
> >>>>>>> I have seen this behavior with the latest 32-bit and 64-bit
> starcluster
> >>>>>>> AMIs. Our workaround is to start a small cluster and progressively
> add nodes
> >>>>>>> one at a time, which is time-consuming.
> >>>>>>>
> >>>>>>> Has anyone else noticed this and have a better workaround or an
> idea for a
> >>>>>>> fix?
> >>>>>>>
> >>>>>>> jeff
> >>>>>>>
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> StarCluster mailing list
> >>>>>>> StarCluster at mit.edu
> >>>>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
> >>>>>>>
> >>>>>>>
> >>>>>> _______________________________________________
> >>>>>> StarCluster mailing list
> >>>>>> StarCluster at mit.edu
> >>>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
> >>>
> >>>>
> >>
> >> -----BEGIN PGP SIGNATURE-----
> >> Version: GnuPG v2.0.17 (GNU/Linux)
> >> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
> >>
> >> iEYEARECAAYFAk2A550ACgkQ4llAkMfDcrlR2gCeOoYMzl9U+z1owIq98JHBgLHi
> >> IngAniUwV6nq/hN6/TfxCBu1d2/MO5Ru
> >> =tXep
> >> -----END PGP SIGNATURE-----
> >>
> >
> > _______________________________________________
> > StarCluster mailing list
> > StarCluster at mit.edu
> > http://mailman.mit.edu/mailman/listinfo/starcluster
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20110406/75dfea84/attachment.htm
More information about the StarCluster
mailing list