[Starcluster] error when starting cluster

Justin Riley jtriley at MIT.EDU
Tue Apr 20 19:20:16 EDT 2010


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Damian,

>>>> Waiting for cluster to start...-

For some reason it's not detecting that the cluster is up. Could you try
with "starcluster -d start ...". That will enable verbose debug output.
What does starcluster listclusters look like? Are all the nodes in a
running state? Also, are you sure your config has cluster_size=8?

~Justin

On 04/20/2010 07:05 PM, Damian Eads wrote:
> I just added 6 instances to make the total 8 high cpu instances 
> (c1.xlarge). I rebooted the existing instances, restarted the
> cluster, and the cluster started without any errors or warnings. I
> tried running my code over 64 cores and after a while NFS fails.
> 
> [domU-12-31-38-04-A1-11:04233] [[651,0],3] routed:binomial:
> Connection to lifeline [[651,0],0] lost 
> [domU-12-31-38-04-A1-11:04233] [[651,0],3]->[[651,0],0] 
> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9) 
> [sd = 9] [domU-12-31-38-04-A1-11:04233] [[651,0],3] routed:binomial:
> Connection to lifeline [[651,0],0] lost 
> [domU-12-31-38-04-A1-11:04233] [[651,0],3]->[[651,0],0] 
> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9) 
> [sd = 9] [domU-12-31-38-04-A0-01:04255] [[651,0],2] routed:binomial:
> Connection to lifeline [[651,0],0] lost 
> [domU-12-31-38-01-61-81:02915] [[651,0],1]->[[651,0],0] 
> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9) 
> [sd = 9] [domU-12-31-38-01-61-81:02915] [[651,0],1] routed:binomial:
> Connection to lifeline [[651,0],0] lost 
> [domU-12-31-38-01-61-81:02915] [[651,0],1]->[[651,0],0] 
> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9) 
> [sd = 9] [domU-12-31-38-01-61-81:02915] [[651,0],1] routed:binomial:
> Connection to lifeline [[651,0],0] lost 
> [domU-12-31-38-01-61-81:02915] [[651,0],1]->[[651,0],0] 
> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9) 
> [sd = 9] [domU-12-31-38-01-61-81:02915] [[651,0],1] routed:binomial:
> Connection to lifeline [[651,0],0] lost 
> [domU-12-31-38-01-61-81:02915] [[651,0],1]->[[651,0],0] 
> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9) 
> [sd = 9]
> 
> I then tried rebooting all the instances, detaching volumes, and 
> restarting the cluster but it hangs.
> 
> eads at street:~/work/repo/StarCluster$ starcluster start -x 
> --cluster-size 8 mycluster dtest 
> /tmp/qqq/lib/python2.6/site-packages/pycrypto-2.0.1-py2.6-linux-x86_64.egg/Crypto/Hash/SHA.py:6:
>
> 
DeprecationWarning: the sha module is deprecated; use the hashlib
> module instead 
> /tmp/qqq/lib/python2.6/site-packages/pycrypto-2.0.1-py2.6-linux-x86_64.egg/Crypto/Hash/MD5.py:6:
>
> 
DeprecationWarning: the md5 module is deprecated; use hashlib instead
> /var/lib/python-support/python2.6/IPython/Magic.py:38: 
> DeprecationWarning: the sets module is deprecated from sets import
> Set StarCluster - (http://web.mit.edu/starcluster) Software Tools for
> Academics and Researchers (STAR) Please submit bug reports to
> starcluster at mit.edu
> 
>>>> Validating cluster settings... Cluster settings are valid 
>>>> Starting cluster... Waiting for cluster to start...-
> 
> I know this output isn't very helpful. I'll see if I can reproduce
> the error.
> 
> Cheers,
> 
> Damian
> 
> On Tue, Apr 20, 2010 at 3:02 PM, Justin Riley <jtriley at mit.edu>
> wrote: Hi Damian,
> 
>>>> It worked, thanks very much for the prompt fix.
> 
> Excellent, glad to hear that.
> 
>>>> Tell me if you think this will work.
> 
> Yep that should work although I don't believe you'll need to reboot
> the instances or even detach the volumes but it shouldn't hurt. The
> big thing is to make sure you have cluster_size consistent with how
> many running nodes are in the cluster's security group. So, you might
> need to do the following assuming your cluster template (mycluster)
> has cluster_size=8 and that there are actually 2 running instances:
> 
> $ starcluster start -x --cluster-size 2 mycluster dtest
> 
> Hope that helps,
> 
> ~Justin
> 
> 
> 
> On 04/20/2010 05:44 PM, Damian Eads wrote:
>>>> Hi Justin,
>>>> 
>>>> It worked, thanks very much for the prompt fix. Before I
>>>> received your e-mail, I killed 6 of my 8 octcore instances to
>>>> save money. Tell me if you think this will work.
>>>> 
>>>> 1. Through the AWS web console, detach currently used volumes. 
>>>> 2. Manually reboot the instances currently running. 3. Manually
>>>> launch additional spot instances in the same availability group
>>>> as the ones currently running. 4. Rerun starcluster start -x
>>>> mycluster dtest
>>>> 
>>>> Being able to restart the cluster without first terminating
>>>> the instances and then relaunching them will save money. Do you
>>>> think this will work? I don't mind doing it manually.
>>>> 
>>>> Thanks a lot in advance!
>>>> 
>>>> Damian
>>>> 
>>>> 
>>>> On Tue, Apr 20, 2010 at 2:16 PM, Justin Riley <jtriley at mit.edu>
>>>> wrote: Hi Damian,
>>>> 
>>>> I believe I've fixed this in github. Could you pull and give it
>>>> another shot?
>>>> 
>>>> Also, I've added support for master/node001/etc aliases to the
>>>> sshnode action. So, you should now be able to:
>>>> 
>>>> $ starcluster sshnode mycluster master $ starcluster sshnode
>>>> mycluster node001 etc
>>>> 
>>>> Please let me know if the latest github code fixes your problem
>>>> below and if you have any other issues.
>>>> 
>>>> Thanks,
>>>> 
>>>> ~Justin
>>>> 
>>>> On 04/20/2010 04:38 PM, Damian Eads wrote:
>>>>>>> Hi Justin,
>>>>>>> 
>>>>>>> I just did a git pull and got the following error when I
>>>>>>> tried creating my cluster. Ideas?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Damian
>>>>>>> 
>>>>>>> eads at street:~/work/repo/StarCluster$ starcluster start -x
>>>>>>> mycluster dtest 
>>>>>>> /tmp/qqq/lib/python2.6/site-packages/pycrypto-2.0.1-py2.6-linux-x86_64.egg/Crypto/Hash/SHA.py:6:
>>>>>>>
>>>>>>> 
DeprecationWarning: the sha module is deprecated; use the hashlib
>>>>>>> module instead 
>>>>>>> /tmp/qqq/lib/python2.6/site-packages/pycrypto-2.0.1-py2.6-linux-x86_64.egg/Crypto/Hash/MD5.py:6:
>>>>>>>
>>>>>>> 
DeprecationWarning: the md5 module is deprecated; use hashlib instead
>>>>>>> /var/lib/python-support/python2.6/IPython/Magic.py:38: 
>>>>>>> DeprecationWarning: the sets module is deprecated from
>>>>>>> sets import Set StarCluster -
>>>>>>> (http://web.mit.edu/starcluster) Software Tools for
>>>>>>> Academics and Researchers (STAR) Please submit bug
>>>>>>> reports to starcluster at mit.edu
>>>>>>> 
>>>>>>>>>> Validating cluster settings... Cluster settings are
>>>>>>>>>> valid Starting cluster... Waiting for cluster to
>>>>>>>>>> start... The master node is
>>>>>>>>>> ec2-174-129-172-124.compute-1.amazonaws.com 
>>>>>>>>>> Attaching volume vol-c5e85dac to master node... 
>>>>>>>>>> Setting up the cluster... Mounting EBS volume
>>>>>>>>>> vol-c5e85dac on /data...
>>>>>>> ssh.py:66 - WARNING - specified key does not end in
>>>>>>> either rsa or dsa, trying both
>>>>>>>>>> Using private key /home/eads/deadskey.pem (rsa)
>>>>>>> ERROR: An unexpected error occurred while tokenizing
>>>>>>> input The following traceback may be corrupted or
>>>>>>> invalid The error message is: ('EOF in multi-line
>>>>>>> statement', (405, 0))
>>>>>>> 
>>>>>>> ---------------------------------------------------------------------------
>>>>>>>
>>>>>>> 
TypeError                                 Traceback (most recent call last)
>>>>>>> 
>>>>>>> /tmp/qqq/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/EGG-INFO/scripts/starcluster
>>>>>>>
>>>>>>> 
in <module>()
>>>>>>> 3 __requires__ = 'StarCluster==0.9999' 4 import
>>>>>>> pkg_resources ----> 5
>>>>>>> pkg_resources.run_script('StarCluster==0.9999',
>>>>>>> 'starcluster') 6 7
>>>>>>> 
>>>>>>> /usr/lib/python2.6/dist-packages/pkg_resources.pyc in
>>>>>>> run_script(self, requires, script_name) 446
>>>>>>> ns.clear() 447         ns['__name__'] = name --> 448
>>>>>>> self.require(requires)[0].run_script(script_name, ns) 
>>>>>>> 449 450
>>>>>>> 
>>>>>>> /usr/lib/python2.6/dist-packages/pkg_resources.pyc in
>>>>>>> run_script(self, script_name, namespace) 1171
>>>>>>> ) 1172             script_code =
>>>>>>> compile(script_text,script_filename,'exec') -> 1173
>>>>>>> exec script_code in namespace, namespace 1174 1175
>>>>>>> def _has(self, path):
>>>>>>> 
>>>>>>> /tmp/qqq/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/EGG-INFO/scripts/starcluster
>>>>>>>
>>>>>>> 
in <module>()
>>>>>>> 4 5 ----> 6 7 8
>>>>>>> 
>>>>>>> /tmp/qqq/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/cli.pyc
>>>>>>>
>>>>>>> 
in main()
>>>>>>> 850         sys.exit(0) 851     try: --> 852
>>>>>>> sc.execute(args) 853     except
>>>>>>> exception.BaseException,e: 854         log.error(e.msg)
>>>>>>> 
>>>>>>> /tmp/qqq/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/cli.pyc
>>>>>>>
>>>>>>> 
in execute(self, args)
>>>>>>> 169             log.info('Cluster settings are valid') 
>>>>>>> 170             if not self.opts.validate_only: --> 171
>>>>>>> scluster.start(create=not self.opts.no_create) 172
>>>>>>> if self.opts.login_master: 173
>>>>>>> cluster.ssh_to_master(tag, self.cfg)
>>>>>>> 
>>>>>>> /tmp/qqq/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/utils.pyc
>>>>>>>
>>>>>>> 
in wrapper(*arg, **kargs)
>>>>>>> 23         """Raw timing function """ 24         time1 =
>>>>>>> time.time() ---> 25         res = func(*arg, **kargs) 26
>>>>>>> time2 = time.time() 27         log.info('%s took %0.3f
>>>>>>> mins' % (func.func_name, (time2-time1)/60.0))
>>>>>>> 
>>>>>>> /tmp/qqq/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/cluster.pyc
>>>>>>>
>>>>>>> 
in start(self, create)
>>>>>>> 476             self.nodes, self.master_node, 477
>>>>>>> self.cluster_user, self.cluster_shell, --> 478
>>>>>>> self.volumes 479         ) 480
>>>>>>> self.create_receipt()
>>>>>>> 
>>>>>>> /tmp/qqq/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/clustersetup.pyc
>>>>>>>
>>>>>>> 
in run(self, nodes, master, user, user_shell, volumes)
>>>>>>> 312         self._volumes = volumes 313
>>>>>>> self._setup_ebs_volume() --> 314
>>>>>>> self._setup_cluster_user() 315
>>>>>>> self._setup_scratch() 316
>>>>>>> self._setup_etc_hosts()
>>>>>>> 
>>>>>>> /tmp/qqq/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/clustersetup.pyc
>>>>>>>
>>>>>>> 
in _setup_cluster_user(self)
>>>>>>> 67             max_uid = max(uid_db.keys()) 68
>>>>>>> max_gid = uid_db[max_uid][1] ---> 69             uid, gid
>>>>>>> = max_uid+1, max_gid+1 70 71         log.debug("Cluster
>>>>>>> user gid/uid: (%d, %d)" % (uid,gid))
>>>>>>> 
>>>>>>> TypeError: unsupported operand type(s) for +: 'NoneType'
>>>>>>> and 'int' eads at street:~/work/repo/StarCluster$ 
>>>>>>> _______________________________________________ 
>>>>>>> Starcluster mailing list Starcluster at mit.edu 
>>>>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>> 
>>>>> 
> 
>> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkvONrAACgkQ4llAkMfDcrlPKwCfVygmU6ca9RIa7q0pXRqpO7qV
RbMAn1uakSMS/tq8x/qhIeXORLptCVRC
=wAZf
-----END PGP SIGNATURE-----



More information about the StarCluster mailing list