[StarCluster] "connection closed" when running "starcluster ssmaster"

Signell, Richard rsignell at usgs.gov
Fri Jan 17 17:01:15 EST 2014


Well, I couldn't reboot either, so I ended up terminating the cluster.
A bit disconcerning, since it lasted only a few hours.

But the good news is that I had made an AMI with all my changes I had
made, and it started cleanly and I'm good to go.

Thanks for the help,
Rich

On Fri, Jan 17, 2014 at 4:11 PM, Rayson Ho <raysonlogin at gmail.com> wrote:
> Hmm, looks like your master machine has some files corrupted, as
> /opt/sge6 is also missing... You may want to reboot the instance, and
> the easiest way to do that without relying on StarCluster or other
> services is to go to the AWS Web management console
> (https://aws.amazon.com/console/ ), sign in, and then select the
> "master" instance, right-click, and then reboot.
>
> It shouldn't take more than a few minutes to reboot the master
> instance. However, if there are too many files corrupted, the reboot
> will fail. If you have important data on the instance, you will then
> need to stop the instance and then mount the EBS volume as a data
> partition on another EC2 instance. (That's slightly more
> complicated...) On the other hand, if it is a test cluster, it is much
> faster and cheaper to destroy the cluster and create a new one.
>
> Rayson
>
> ==================================================
> Open Grid Scheduler - The Official Open Source Grid Engine
> http://gridscheduler.sourceforge.net/
> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>
>
> On Fri, Jan 17, 2014 at 3:54 PM, Signell, Richard <rsignell at usgs.gov> wrote:
>> root at node001:~# qrsh -l h=master
>> The program 'qrsh' is currently not installed.  You can install it by typing:
>> apt-get install gridengine-client
>>
>> I then tried installing gridengine-client, but I wasn't sure of all
>> the choices, and got some worrisome output:
>>
>> setting default_transport: error
>> setting relay_transport: error
>> /etc/aliases does not exist, creating it.
>> WARNING: /etc/aliases exists, but does not have a root alias.
>>
>>
>> At the end, when I try the command, it still fails:
>>
>> root at node001:~# qrsh -l h=master
>> error: cell directory "/opt/sge6/default" doesn't exist
>>
>>
>> On Fri, Jan 17, 2014 at 1:18 PM, Rayson Ho <raysonlogin at gmail.com> wrote:
>>> On Fri, Jan 17, 2014 at 11:11 AM, Signell, Richard <rsignell at usgs.gov> wrote:
>>>> Rayson,
>>>>
>>>> Okay, I tried your suggestion:
>>>>
>>>> 1.   I can ssh to node001 just fine.
>>>>
>>>
>>> That's good, so it means that the private key on your side works. And
>>> on Amazon's side, the slave node also has a working public key.
>>>
>>>
>>>> 2.  I cannot ssh to master:
>>>>
>>> ...
>>>>
>>>> Does that give a clue?
>>>
>>> So it seems like the master node doesn't like your key. You can try:
>>>
>>> - First SSH into node001, and then run "ssh master".
>>>
>>> - And if the above still fails, try to use the Grid Engine mechanism
>>> to get back into the master node. So, first SSH into node001. Then run
>>> "qrsh -l h=master":
>>>
>>> root at node001:~# qrsh -l h=master
>>> groups: cannot find name for group ID 20003
>>> root at master:~#
>>>
>>> (Ignore the complain about the extra Group ID (GID), it is an
>>> additional GID added by Grid Engine.
>>>
>>>
>>>> [also, is it okay to have this type of discussion on this list]?
>>>
>>> As long as it is StarCluster, Amazon EC2, and Grid Engine related,
>>> then I am happy to help! :-D
>>>
>>> Rayson
>>>
>>> ==================================================
>>> Open Grid Scheduler - The Official Open Source Grid Engine
>>> http://gridscheduler.sourceforge.net/
>>> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>>>
>>>
>>>>
>>>> -Rich
>>>>
>>>> On Fri, Jan 17, 2014 at 10:56 AM, Rayson Ho <raysonlogin at gmail.com> wrote:
>>>>> Did you overwrite your SSH private key
>>>>> (/home/rsignell/.ssh/mykey2.rsa) with a new one?
>>>>>
>>>>> Also, can you run the SSH client directly from the command line with
>>>>> verbose (-v) on and see if that gives you anything?
>>>>>
>>>>> Example:
>>>>>
>>>>> % ssh -v -i /home/rsignell/.ssh/mykey2.rsa
>>>>> root at ec2-54-196-2-68.compute-1.amazonaws.com
>>>>>
>>>>> Rayson
>>>>>
>>>>> ==================================================
>>>>> Open Grid Scheduler - The Official Open Source Grid Engine
>>>>> http://gridscheduler.sourceforge.net/
>>>>> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>>>>>
>>>>>
>>>>> On Fri, Jan 17, 2014 at 8:03 AM, Signell, Richard <rsignell at usgs.gov> wrote:
>>>>>> Rayson,
>>>>>>
>>>>>> I tried ssh'ing into node001 as you suggested, but the process just
>>>>>> seems to hang.  I waited 5 minutes, tried to ctrl-c, ctrl-z, nothing
>>>>>> worked.   Finally killed terminal.
>>>>>>
>>>>>> What should I try next?
>>>>>>
>>>>>> rsignell at gam:~$ starcluster -d sshnode rps_cluster node001
>>>>>> StarCluster - (http://star.mit.edu/cluster) (v. 0.94.3)
>>>>>> Software Tools for Academics and Researchers (STAR)
>>>>>> Please submit bug reports to starcluster at mit.edu
>>>>>>
>>>>>> 2014-01-17 07:59:34,900 config.py:567 - DEBUG - Loading config
>>>>>> 2014-01-17 07:59:34,900 config.py:138 - DEBUG - Loading file:
>>>>>> /home/rsignell/.starcluster/config
>>>>>> ...
>>>>>>
>>>>>> 2014-01-17 07:59:34,935 awsutils.py:74 - DEBUG - creating self._conn
>>>>>> w/ connection_authenticator kwargs = {'proxy_user': None,
>>>>>> 'proxy_pass': None, 'proxy_port': None, 'proxy': None, 'is_secure':
>>>>>> True, 'path': '/', 'region': None, 'validate_certs': True, 'port':
>>>>>> None}
>>>>>> 2014-01-17 07:59:35,797 cluster.py:711 - DEBUG - existing nodes: {}
>>>>>> 2014-01-17 07:59:35,797 cluster.py:719 - DEBUG - adding node
>>>>>> i-7a50f654 to self._nodes list
>>>>>> 2014-01-17 07:59:35,797 cluster.py:719 - DEBUG - adding node
>>>>>> i-7950f657 to self._nodes list
>>>>>> 2014-01-17 07:59:35,798 cluster.py:727 - DEBUG - returning self._nodes
>>>>>> = [<Node: master (i-7950f657)>, <Node: node001 (i-7a50f654)>]
>>>>>> 2014-01-17 07:59:35,905 cluster.py:711 - DEBUG - existing nodes:
>>>>>> {u'i-7a50f654': <Node: node001 (i-7a50f654)>, u'i-7950f657': <Node:
>>>>>> master (i-7950f657)>}
>>>>>> 2014-01-17 07:59:35,906 cluster.py:714 - DEBUG - updating existing
>>>>>> node i-7a50f654 in self._nodes
>>>>>> 2014-01-17 07:59:35,906 cluster.py:714 - DEBUG - updating existing
>>>>>> node i-7950f657 in self._nodes
>>>>>> 2014-01-17 07:59:35,906 cluster.py:727 - DEBUG - returning self._nodes
>>>>>> = [<Node: master (i-7950f657)>, <Node: node001 (i-7a50f654)>]
>>>>>> 2014-01-17 07:59:36,119 node.py:1039 - DEBUG - Using native OpenSSH client
>>>>>> 2014-01-17 07:59:36,119 node.py:1050 - DEBUG - ssh_cmd: ssh -i
>>>>>> /home/rsignell/.ssh/mykey2.rsa
>>>>>> root at ec2-54-196-2-68.compute-1.amazonaws.com
>>>>>> [wait, wait.... nothing....]
>>>>>>
>>>>>>
>>>>>> On Thu, Jan 16, 2014 at 5:24 PM, Rayson Ho <raysonlogin at gmail.com> wrote:
>>>>>>> The SSH daemon is responding (and the EC2 security group is not
>>>>>>> blocking traffic), which is good.
>>>>>>>
>>>>>>> However, logging onto the master was working a few hours ago and not
>>>>>>> anymore, then try to log onto the Grid Engine execution node by using,
>>>>>>> for example, "starcluster sshnode rps_cluster node001". If SSHing into
>>>>>>> the execution node works, then it is likely to be an issue with the
>>>>>>> StarCluster master instance.
>>>>>>>
>>>>>>> Rayson
>>>>>>>
>>>>>>> ==================================================
>>>>>>> Open Grid Scheduler - The Official Open Source Grid Engine
>>>>>>> http://gridscheduler.sourceforge.net/
>>>>>>> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jan 16, 2014 at 4:55 PM, Signell, Richard <rsignell at usgs.gov> wrote:
>>>>>>>> I set up a machine this morning and
>>>>>>>> starcluster sshmaster rps_cluster
>>>>>>>> was working fine to ssh in.
>>>>>>>>
>>>>>>>> But now I'm getting "Connection closed by 54.204.55.67"
>>>>>>>>
>>>>>>>> It seem that the cluster is running:
>>>>>>>>
>>>>>>>> rsignell at gam:~$ starcluster listclusters
>>>>>>>> StarCluster - (http://star.mit.edu/cluster) (v. 0.94.3)
>>>>>>>> Software Tools for Academics and Researchers (STAR)
>>>>>>>> Please submit bug reports to starcluster at mit.edu
>>>>>>>>
>>>>>>>> ---------------------------------------------
>>>>>>>> rps_cluster (security group: @sc-rps_cluster)
>>>>>>>> ---------------------------------------------
>>>>>>>> Launch time: 2014-01-16 08:18:09
>>>>>>>> Uptime: 0 days, 08:34:07
>>>>>>>> Zone: us-east-1a
>>>>>>>> Keypair: mykey2
>>>>>>>> EBS volumes: N/A
>>>>>>>> Cluster nodes:
>>>>>>>>      master running i-7950f657 ec2-54-204-55-67.compute-1.amazonaws.com
>>>>>>>>     node001 running i-7a50f654 ec2-54-196-2-68.compute-1.amazonaws.com
>>>>>>>> Total nodes: 2
>>>>>>>>
>>>>>>>> And I don't see anything obvious in the verbose debug output:
>>>>>>>>
>>>>>>>> rsignell at gam:~$ starcluster -d sshmaster rps_cluster
>>>>>>>> StarCluster - (http://star.mit.edu/cluster) (v. 0.94.3)
>>>>>>>> Software Tools for Academics and Researchers (STAR)
>>>>>>>> Please submit bug reports to starcluster at mit.edu
>>>>>>>>
>>>>>>>> 2014-01-16 16:53:13,515 config.py:567 - DEBUG - Loading config
>>>>>>>> 2014-01-16 16:53:13,515 config.py:138 - DEBUG - Loading file:
>>>>>>>> /home/rsignell/.starcluster/config
>>>>>>>> 2014-01-16 16:53:13,517 config.py:322 - DEBUG - include setting not
>>>>>>>> specified. Defaulting to []
>>>>>>>> 2014-01-16 16:53:13,518 config.py:322 - DEBUG - web_browser setting
>>>>>>>> not specified. Defaulting to None
>>>>>>>> 2014-01-16 16:53:13,518 config.py:322 - DEBUG - refresh_interval
>>>>>>>> setting not specified. Defaulting to 30
>>>>>>>> 2014-01-16 16:53:13,518 config.py:322 - DEBUG - include setting not
>>>>>>>> specified. Defaulting to []
>>>>>>>> 2014-01-16 16:53:13,518 config.py:322 - DEBUG - web_browser setting
>>>>>>>> not specified. Defaulting to None
>>>>>>>> 2014-01-16 16:53:13,519 config.py:322 - DEBUG - refresh_interval
>>>>>>>> setting not specified. Defaulting to 30
>>>>>>>> 2014-01-16 16:53:13,519 config.py:322 - DEBUG - aws_proxy_pass setting
>>>>>>>> not specified. Defaulting to None
>>>>>>>> 2014-01-16 16:53:13,519 config.py:322 - DEBUG - aws_validate_certs
>>>>>>>> setting not specified. Defaulting to True
>>>>>>>> 2014-01-16 16:53:13,520 config.py:322 - DEBUG - aws_ec2_path setting
>>>>>>>> not specified. Defaulting to /
>>>>>>>> 2014-01-16 16:53:13,520 config.py:322 - DEBUG - aws_region_name
>>>>>>>> setting not specified. Defaulting to None
>>>>>>>> 2014-01-16 16:53:13,521 config.py:322 - DEBUG - aws_region_host
>>>>>>>> setting not specified. Defaulting to None
>>>>>>>> 2014-01-16 16:53:13,521 config.py:322 - DEBUG - aws_s3_path setting
>>>>>>>> not specified. Defaulting to /
>>>>>>>> 2014-01-16 16:53:13,521 config.py:322 - DEBUG - aws_proxy_user setting
>>>>>>>> not specified. Defaulting to None
>>>>>>>> 2014-01-16 16:53:13,521 config.py:322 - DEBUG - aws_is_secure setting
>>>>>>>> not specified. Defaulting to True
>>>>>>>> 2014-01-16 16:53:13,522 config.py:322 - DEBUG - aws_s3_host setting
>>>>>>>> not specified. Defaulting to None
>>>>>>>> 2014-01-16 16:53:13,522 config.py:322 - DEBUG - aws_port setting not
>>>>>>>> specified. Defaulting to None
>>>>>>>> 2014-01-16 16:53:13,522 config.py:322 - DEBUG - ec2_private_key
>>>>>>>> setting not specified. Defaulting to None
>>>>>>>> 2014-01-16 16:53:13,522 config.py:322 - DEBUG - ec2_cert setting not
>>>>>>>> specified. Defaulting to None
>>>>>>>> 2014-01-16 16:53:13,523 config.py:322 - DEBUG - aws_proxy setting not
>>>>>>>> specified. Defaulting to None
>>>>>>>> 2014-01-16 16:53:13,523 config.py:322 - DEBUG - aws_proxy_port setting
>>>>>>>> not specified. Defaulting to None
>>>>>>>> 2014-01-16 16:53:13,523 config.py:322 - DEBUG - device setting not
>>>>>>>> specified. Defaulting to None
>>>>>>>> 2014-01-16 16:53:13,523 config.py:322 - DEBUG - partition setting not
>>>>>>>> specified. Defaulting to None
>>>>>>>> 2014-01-16 16:53:13,524 config.py:322 - DEBUG - device setting not
>>>>>>>> specified. Defaulting to None
>>>>>>>> 2014-01-16 16:53:13,524 config.py:322 - DEBUG - partition setting not
>>>>>>>> specified. Defaulting to None
>>>>>>>> 2014-01-16 16:53:13,525 config.py:322 - DEBUG - disable_queue setting
>>>>>>>> not specified. Defaulting to False
>>>>>>>> 2014-01-16 16:53:13,525 config.py:322 - DEBUG - volumes setting not
>>>>>>>> specified. Defaulting to []
>>>>>>>> 2014-01-16 16:53:13,525 config.py:322 - DEBUG - availability_zone
>>>>>>>> setting not specified. Defaulting to None
>>>>>>>> 2014-01-16 16:53:13,526 config.py:322 - DEBUG - spot_bid setting not
>>>>>>>> specified. Defaulting to None
>>>>>>>> 2014-01-16 16:53:13,526 config.py:322 - DEBUG - master_instance_type
>>>>>>>> setting not specified. Defaulting to None
>>>>>>>> 2014-01-16 16:53:13,526 config.py:322 - DEBUG - disable_cloudinit
>>>>>>>> setting not specified. Defaulting to False
>>>>>>>> 2014-01-16 16:53:13,526 config.py:322 - DEBUG - force_spot_master
>>>>>>>> setting not specified. Defaulting to False
>>>>>>>> 2014-01-16 16:53:13,526 config.py:322 - DEBUG - extends setting not
>>>>>>>> specified. Defaulting to None
>>>>>>>> 2014-01-16 16:53:13,526 config.py:322 - DEBUG - master_image_id
>>>>>>>> setting not specified. Defaulting to None
>>>>>>>> 2014-01-16 16:53:13,527 config.py:322 - DEBUG - userdata_scripts
>>>>>>>> setting not specified. Defaulting to []
>>>>>>>> 2014-01-16 16:53:13,527 config.py:322 - DEBUG - permissions setting
>>>>>>>> not specified. Defaulting to []
>>>>>>>> 2014-01-16 16:53:13,529 awsutils.py:74 - DEBUG - creating self._conn
>>>>>>>> w/ connection_authenticator kwargs = {'proxy_user': None,
>>>>>>>> 'proxy_pass': None, 'proxy_port': None, 'proxy': None, 'is_secure':
>>>>>>>> True, 'path': '/', 'region': None, 'validate_certs': True, 'port':
>>>>>>>> None}
>>>>>>>> 2014-01-16 16:53:13,872 cluster.py:711 - DEBUG - existing nodes: {}
>>>>>>>> 2014-01-16 16:53:13,872 cluster.py:719 - DEBUG - adding node
>>>>>>>> i-7a50f654 to self._nodes list
>>>>>>>> 2014-01-16 16:53:13,873 cluster.py:719 - DEBUG - adding node
>>>>>>>> i-7950f657 to self._nodes list
>>>>>>>> 2014-01-16 16:53:13,873 cluster.py:727 - DEBUG - returning self._nodes
>>>>>>>> = [<Node: master (i-7950f657)>, <Node: node001 (i-7a50f654)>]
>>>>>>>> 2014-01-16 16:53:14,063 cluster.py:711 - DEBUG - existing nodes:
>>>>>>>> {u'i-7a50f654': <Node: node001 (i-7a50f654)>, u'i-7950f657': <Node:
>>>>>>>> master (i-7950f657)>}
>>>>>>>> 2014-01-16 16:53:14,064 cluster.py:714 - DEBUG - updating existing
>>>>>>>> node i-7a50f654 in self._nodes
>>>>>>>> 2014-01-16 16:53:14,064 cluster.py:714 - DEBUG - updating existing
>>>>>>>> node i-7950f657 in self._nodes
>>>>>>>> 2014-01-16 16:53:14,064 cluster.py:727 - DEBUG - returning self._nodes
>>>>>>>> = [<Node: master (i-7950f657)>, <Node: node001 (i-7a50f654)>]
>>>>>>>> 2014-01-16 16:53:14,168 node.py:1039 - DEBUG - Using native OpenSSH client
>>>>>>>> 2014-01-16 16:53:14,169 node.py:1050 - DEBUG - ssh_cmd: ssh -i
>>>>>>>> /home/rsignell/.ssh/mykey2.rsa
>>>>>>>> root at ec2-54-204-55-67.compute-1.amazonaws.com
>>>>>>>> Connection closed by 54.204.55.67
>>>>>>>>
>>>>>>>>
>>>>>>>> I didn't see any "common problems" or "troubleshooting" sections in
>>>>>>>> the starcluster documentation, and I checked the FAQ and the mailing
>>>>>>>> list archives, but I probably overlooked something, as this certainly
>>>>>>>> seems like a newbie question (which I am).
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Rich
>>>>>>>> --
>>>>>>>> Dr. Richard P. Signell   (508) 457-2229
>>>>>>>> USGS, 384 Woods Hole Rd.
>>>>>>>> Woods Hole, MA 02543-1598
>>>>>>>> _______________________________________________
>>>>>>>> StarCluster mailing list
>>>>>>>> StarCluster at mit.edu
>>>>>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Dr. Richard P. Signell   (508) 457-2229
>>>>>> USGS, 384 Woods Hole Rd.
>>>>>> Woods Hole, MA 02543-1598
>>>>
>>>>
>>>>
>>>> --
>>>> Dr. Richard P. Signell   (508) 457-2229
>>>> USGS, 384 Woods Hole Rd.
>>>> Woods Hole, MA 02543-1598
>>
>>
>>
>> --
>> Dr. Richard P. Signell   (508) 457-2229
>> USGS, 384 Woods Hole Rd.
>> Woods Hole, MA 02543-1598



-- 
Dr. Richard P. Signell   (508) 457-2229
USGS, 384 Woods Hole Rd.
Woods Hole, MA 02543-1598


More information about the StarCluster mailing list