[StarCluster] "connection closed" when running "starcluster ssmaster"

Signell, Richard rsignell at usgs.gov
Fri Jan 17 11:11:27 EST 2014


Rayson,

Okay, I tried your suggestion:

1.   I can ssh to node001 just fine.

rsignell at gam:~$ ssh -v -i /home/rsignell/.ssh/mykey2.rsa
root at ec2-54-196-2-68.compute-1.amazonaws.com
OpenSSH_5.9p1 Debian-5ubuntu1, OpenSSL 1.0.1 14 Mar 2012
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: Applying options for *
debug1: Connecting to ec2-54-196-2-68.compute-1.amazonaws.com
[54.196.2.68] port 22.
debug1: Connection established.
debug1: identity file /home/rsignell/.ssh/mykey2.rsa type -1
debug1: identity file /home/rsignell/.ssh/mykey2.rsa-cert type -1
debug1: Remote protocol version 2.0, remote software version
OpenSSH_5.9p1 Debian-5ubuntu1
debug1: match: OpenSSH_5.9p1 Debian-5ubuntu1 pat OpenSSH*
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_5.9p1 Debian-5ubuntu1
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
...
[successful login]



2.  I cannot ssh to master:

rsignell at gam:~$ ssh -v -i /home/rsignell/.ssh/mykey2.rsa
root at ec2-54-204-55-67.compute-1.amazonaws.com
OpenSSH_5.9p1 Debian-5ubuntu1, OpenSSL 1.0.1 14 Mar 2012
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: Applying options for *
debug1: Connecting to ec2-54-204-55-67.compute-1.amazonaws.com
[54.204.55.67] port 22.
debug1: Connection established.
debug1: identity file /home/rsignell/.ssh/mykey2.rsa type -1
debug1: identity file /home/rsignell/.ssh/mykey2.rsa-cert type -1
debug1: Remote protocol version 2.0, remote software version
OpenSSH_5.9p1 Debian-5ubuntu1
debug1: match: OpenSSH_5.9p1 Debian-5ubuntu1 pat OpenSSH*
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_5.9p1 Debian-5ubuntu1
debug1: SSH2_MSG_KEXINIT sent
Connection closed by 54.204.55.67

So it would seem that on master, the SSH2_MSG_KEXINIT was sent but not received.

Does that give a clue?

[also, is it okay to have this type of discussion on this list]?

-Rich

On Fri, Jan 17, 2014 at 10:56 AM, Rayson Ho <raysonlogin at gmail.com> wrote:
> Did you overwrite your SSH private key
> (/home/rsignell/.ssh/mykey2.rsa) with a new one?
>
> Also, can you run the SSH client directly from the command line with
> verbose (-v) on and see if that gives you anything?
>
> Example:
>
> % ssh -v -i /home/rsignell/.ssh/mykey2.rsa
> root at ec2-54-196-2-68.compute-1.amazonaws.com
>
> Rayson
>
> ==================================================
> Open Grid Scheduler - The Official Open Source Grid Engine
> http://gridscheduler.sourceforge.net/
> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>
>
> On Fri, Jan 17, 2014 at 8:03 AM, Signell, Richard <rsignell at usgs.gov> wrote:
>> Rayson,
>>
>> I tried ssh'ing into node001 as you suggested, but the process just
>> seems to hang.  I waited 5 minutes, tried to ctrl-c, ctrl-z, nothing
>> worked.   Finally killed terminal.
>>
>> What should I try next?
>>
>> rsignell at gam:~$ starcluster -d sshnode rps_cluster node001
>> StarCluster - (http://star.mit.edu/cluster) (v. 0.94.3)
>> Software Tools for Academics and Researchers (STAR)
>> Please submit bug reports to starcluster at mit.edu
>>
>> 2014-01-17 07:59:34,900 config.py:567 - DEBUG - Loading config
>> 2014-01-17 07:59:34,900 config.py:138 - DEBUG - Loading file:
>> /home/rsignell/.starcluster/config
>> ...
>>
>> 2014-01-17 07:59:34,935 awsutils.py:74 - DEBUG - creating self._conn
>> w/ connection_authenticator kwargs = {'proxy_user': None,
>> 'proxy_pass': None, 'proxy_port': None, 'proxy': None, 'is_secure':
>> True, 'path': '/', 'region': None, 'validate_certs': True, 'port':
>> None}
>> 2014-01-17 07:59:35,797 cluster.py:711 - DEBUG - existing nodes: {}
>> 2014-01-17 07:59:35,797 cluster.py:719 - DEBUG - adding node
>> i-7a50f654 to self._nodes list
>> 2014-01-17 07:59:35,797 cluster.py:719 - DEBUG - adding node
>> i-7950f657 to self._nodes list
>> 2014-01-17 07:59:35,798 cluster.py:727 - DEBUG - returning self._nodes
>> = [<Node: master (i-7950f657)>, <Node: node001 (i-7a50f654)>]
>> 2014-01-17 07:59:35,905 cluster.py:711 - DEBUG - existing nodes:
>> {u'i-7a50f654': <Node: node001 (i-7a50f654)>, u'i-7950f657': <Node:
>> master (i-7950f657)>}
>> 2014-01-17 07:59:35,906 cluster.py:714 - DEBUG - updating existing
>> node i-7a50f654 in self._nodes
>> 2014-01-17 07:59:35,906 cluster.py:714 - DEBUG - updating existing
>> node i-7950f657 in self._nodes
>> 2014-01-17 07:59:35,906 cluster.py:727 - DEBUG - returning self._nodes
>> = [<Node: master (i-7950f657)>, <Node: node001 (i-7a50f654)>]
>> 2014-01-17 07:59:36,119 node.py:1039 - DEBUG - Using native OpenSSH client
>> 2014-01-17 07:59:36,119 node.py:1050 - DEBUG - ssh_cmd: ssh -i
>> /home/rsignell/.ssh/mykey2.rsa
>> root at ec2-54-196-2-68.compute-1.amazonaws.com
>> [wait, wait.... nothing....]
>>
>>
>> On Thu, Jan 16, 2014 at 5:24 PM, Rayson Ho <raysonlogin at gmail.com> wrote:
>>> The SSH daemon is responding (and the EC2 security group is not
>>> blocking traffic), which is good.
>>>
>>> However, logging onto the master was working a few hours ago and not
>>> anymore, then try to log onto the Grid Engine execution node by using,
>>> for example, "starcluster sshnode rps_cluster node001". If SSHing into
>>> the execution node works, then it is likely to be an issue with the
>>> StarCluster master instance.
>>>
>>> Rayson
>>>
>>> ==================================================
>>> Open Grid Scheduler - The Official Open Source Grid Engine
>>> http://gridscheduler.sourceforge.net/
>>> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>>>
>>>
>>> On Thu, Jan 16, 2014 at 4:55 PM, Signell, Richard <rsignell at usgs.gov> wrote:
>>>> I set up a machine this morning and
>>>> starcluster sshmaster rps_cluster
>>>> was working fine to ssh in.
>>>>
>>>> But now I'm getting "Connection closed by 54.204.55.67"
>>>>
>>>> It seem that the cluster is running:
>>>>
>>>> rsignell at gam:~$ starcluster listclusters
>>>> StarCluster - (http://star.mit.edu/cluster) (v. 0.94.3)
>>>> Software Tools for Academics and Researchers (STAR)
>>>> Please submit bug reports to starcluster at mit.edu
>>>>
>>>> ---------------------------------------------
>>>> rps_cluster (security group: @sc-rps_cluster)
>>>> ---------------------------------------------
>>>> Launch time: 2014-01-16 08:18:09
>>>> Uptime: 0 days, 08:34:07
>>>> Zone: us-east-1a
>>>> Keypair: mykey2
>>>> EBS volumes: N/A
>>>> Cluster nodes:
>>>>      master running i-7950f657 ec2-54-204-55-67.compute-1.amazonaws.com
>>>>     node001 running i-7a50f654 ec2-54-196-2-68.compute-1.amazonaws.com
>>>> Total nodes: 2
>>>>
>>>> And I don't see anything obvious in the verbose debug output:
>>>>
>>>> rsignell at gam:~$ starcluster -d sshmaster rps_cluster
>>>> StarCluster - (http://star.mit.edu/cluster) (v. 0.94.3)
>>>> Software Tools for Academics and Researchers (STAR)
>>>> Please submit bug reports to starcluster at mit.edu
>>>>
>>>> 2014-01-16 16:53:13,515 config.py:567 - DEBUG - Loading config
>>>> 2014-01-16 16:53:13,515 config.py:138 - DEBUG - Loading file:
>>>> /home/rsignell/.starcluster/config
>>>> 2014-01-16 16:53:13,517 config.py:322 - DEBUG - include setting not
>>>> specified. Defaulting to []
>>>> 2014-01-16 16:53:13,518 config.py:322 - DEBUG - web_browser setting
>>>> not specified. Defaulting to None
>>>> 2014-01-16 16:53:13,518 config.py:322 - DEBUG - refresh_interval
>>>> setting not specified. Defaulting to 30
>>>> 2014-01-16 16:53:13,518 config.py:322 - DEBUG - include setting not
>>>> specified. Defaulting to []
>>>> 2014-01-16 16:53:13,518 config.py:322 - DEBUG - web_browser setting
>>>> not specified. Defaulting to None
>>>> 2014-01-16 16:53:13,519 config.py:322 - DEBUG - refresh_interval
>>>> setting not specified. Defaulting to 30
>>>> 2014-01-16 16:53:13,519 config.py:322 - DEBUG - aws_proxy_pass setting
>>>> not specified. Defaulting to None
>>>> 2014-01-16 16:53:13,519 config.py:322 - DEBUG - aws_validate_certs
>>>> setting not specified. Defaulting to True
>>>> 2014-01-16 16:53:13,520 config.py:322 - DEBUG - aws_ec2_path setting
>>>> not specified. Defaulting to /
>>>> 2014-01-16 16:53:13,520 config.py:322 - DEBUG - aws_region_name
>>>> setting not specified. Defaulting to None
>>>> 2014-01-16 16:53:13,521 config.py:322 - DEBUG - aws_region_host
>>>> setting not specified. Defaulting to None
>>>> 2014-01-16 16:53:13,521 config.py:322 - DEBUG - aws_s3_path setting
>>>> not specified. Defaulting to /
>>>> 2014-01-16 16:53:13,521 config.py:322 - DEBUG - aws_proxy_user setting
>>>> not specified. Defaulting to None
>>>> 2014-01-16 16:53:13,521 config.py:322 - DEBUG - aws_is_secure setting
>>>> not specified. Defaulting to True
>>>> 2014-01-16 16:53:13,522 config.py:322 - DEBUG - aws_s3_host setting
>>>> not specified. Defaulting to None
>>>> 2014-01-16 16:53:13,522 config.py:322 - DEBUG - aws_port setting not
>>>> specified. Defaulting to None
>>>> 2014-01-16 16:53:13,522 config.py:322 - DEBUG - ec2_private_key
>>>> setting not specified. Defaulting to None
>>>> 2014-01-16 16:53:13,522 config.py:322 - DEBUG - ec2_cert setting not
>>>> specified. Defaulting to None
>>>> 2014-01-16 16:53:13,523 config.py:322 - DEBUG - aws_proxy setting not
>>>> specified. Defaulting to None
>>>> 2014-01-16 16:53:13,523 config.py:322 - DEBUG - aws_proxy_port setting
>>>> not specified. Defaulting to None
>>>> 2014-01-16 16:53:13,523 config.py:322 - DEBUG - device setting not
>>>> specified. Defaulting to None
>>>> 2014-01-16 16:53:13,523 config.py:322 - DEBUG - partition setting not
>>>> specified. Defaulting to None
>>>> 2014-01-16 16:53:13,524 config.py:322 - DEBUG - device setting not
>>>> specified. Defaulting to None
>>>> 2014-01-16 16:53:13,524 config.py:322 - DEBUG - partition setting not
>>>> specified. Defaulting to None
>>>> 2014-01-16 16:53:13,525 config.py:322 - DEBUG - disable_queue setting
>>>> not specified. Defaulting to False
>>>> 2014-01-16 16:53:13,525 config.py:322 - DEBUG - volumes setting not
>>>> specified. Defaulting to []
>>>> 2014-01-16 16:53:13,525 config.py:322 - DEBUG - availability_zone
>>>> setting not specified. Defaulting to None
>>>> 2014-01-16 16:53:13,526 config.py:322 - DEBUG - spot_bid setting not
>>>> specified. Defaulting to None
>>>> 2014-01-16 16:53:13,526 config.py:322 - DEBUG - master_instance_type
>>>> setting not specified. Defaulting to None
>>>> 2014-01-16 16:53:13,526 config.py:322 - DEBUG - disable_cloudinit
>>>> setting not specified. Defaulting to False
>>>> 2014-01-16 16:53:13,526 config.py:322 - DEBUG - force_spot_master
>>>> setting not specified. Defaulting to False
>>>> 2014-01-16 16:53:13,526 config.py:322 - DEBUG - extends setting not
>>>> specified. Defaulting to None
>>>> 2014-01-16 16:53:13,526 config.py:322 - DEBUG - master_image_id
>>>> setting not specified. Defaulting to None
>>>> 2014-01-16 16:53:13,527 config.py:322 - DEBUG - userdata_scripts
>>>> setting not specified. Defaulting to []
>>>> 2014-01-16 16:53:13,527 config.py:322 - DEBUG - permissions setting
>>>> not specified. Defaulting to []
>>>> 2014-01-16 16:53:13,529 awsutils.py:74 - DEBUG - creating self._conn
>>>> w/ connection_authenticator kwargs = {'proxy_user': None,
>>>> 'proxy_pass': None, 'proxy_port': None, 'proxy': None, 'is_secure':
>>>> True, 'path': '/', 'region': None, 'validate_certs': True, 'port':
>>>> None}
>>>> 2014-01-16 16:53:13,872 cluster.py:711 - DEBUG - existing nodes: {}
>>>> 2014-01-16 16:53:13,872 cluster.py:719 - DEBUG - adding node
>>>> i-7a50f654 to self._nodes list
>>>> 2014-01-16 16:53:13,873 cluster.py:719 - DEBUG - adding node
>>>> i-7950f657 to self._nodes list
>>>> 2014-01-16 16:53:13,873 cluster.py:727 - DEBUG - returning self._nodes
>>>> = [<Node: master (i-7950f657)>, <Node: node001 (i-7a50f654)>]
>>>> 2014-01-16 16:53:14,063 cluster.py:711 - DEBUG - existing nodes:
>>>> {u'i-7a50f654': <Node: node001 (i-7a50f654)>, u'i-7950f657': <Node:
>>>> master (i-7950f657)>}
>>>> 2014-01-16 16:53:14,064 cluster.py:714 - DEBUG - updating existing
>>>> node i-7a50f654 in self._nodes
>>>> 2014-01-16 16:53:14,064 cluster.py:714 - DEBUG - updating existing
>>>> node i-7950f657 in self._nodes
>>>> 2014-01-16 16:53:14,064 cluster.py:727 - DEBUG - returning self._nodes
>>>> = [<Node: master (i-7950f657)>, <Node: node001 (i-7a50f654)>]
>>>> 2014-01-16 16:53:14,168 node.py:1039 - DEBUG - Using native OpenSSH client
>>>> 2014-01-16 16:53:14,169 node.py:1050 - DEBUG - ssh_cmd: ssh -i
>>>> /home/rsignell/.ssh/mykey2.rsa
>>>> root at ec2-54-204-55-67.compute-1.amazonaws.com
>>>> Connection closed by 54.204.55.67
>>>>
>>>>
>>>> I didn't see any "common problems" or "troubleshooting" sections in
>>>> the starcluster documentation, and I checked the FAQ and the mailing
>>>> list archives, but I probably overlooked something, as this certainly
>>>> seems like a newbie question (which I am).
>>>>
>>>> Thanks,
>>>> Rich
>>>> --
>>>> Dr. Richard P. Signell   (508) 457-2229
>>>> USGS, 384 Woods Hole Rd.
>>>> Woods Hole, MA 02543-1598
>>>> _______________________________________________
>>>> StarCluster mailing list
>>>> StarCluster at mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>>
>>
>> --
>> Dr. Richard P. Signell   (508) 457-2229
>> USGS, 384 Woods Hole Rd.
>> Woods Hole, MA 02543-1598



-- 
Dr. Richard P. Signell   (508) 457-2229
USGS, 384 Woods Hole Rd.
Woods Hole, MA 02543-1598


More information about the StarCluster mailing list