[StarCluster] set_keepalive

David Stuebe DStuebe at asascience.com
Fri Apr 25 11:19:16 EDT 2014



Hi Starcluster

I have been building some plugins that take a while to run because they are building and installing large libraries. As a result I have seen issues with ssh terminating my connection while the process is still running. Which seems to return exit code 1, although the process continues on the cluster.

For my custom plugins I added the following line to apply a keep alive to the ssh transport.

def run(self, nodes, master, user, user_shell, volumes):
…
        for node in nodes:
            node.ssh.transport.set_keepalive(30)
…

This can be done this way, but you might consider adding it somewhere in starcluster, probably in the connect method of SSHClient:
https://github.com/jtriley/StarCluster/blob/develop/starcluster/sshutils.py#L100


Here are the methods form paramiko:
https://github.com/paramiko/paramiko/blob/master/paramiko/packet.py#L175
https://github.com/paramiko/paramiko/blob/master/paramiko/transport.py#L762


Another step that would help is to add a longer disconnect to the default /etc/ssh/sshd_config in the cluster ami.

For instance I have used one of my plugins to set:
ClientAliveInterval 600
ClientAliveCountMax 3

That should keep ssh connections open for half an hour.

David Stuebe
Scientist & Software Engineer

55 Village Square Drive
South Kingstown, RI 02879-8248

Tel: +1 (401) 789-6224
Email: David.Stuebe at rpsgroup.com<mailto:David.Stuebe at rpsgroup.com>
www: asascience.com<http://www.asascience.com/> | rpsgroup.com<http://www.rpsgroup.com/>

A member of the RPS Group plc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20140425/335bf3ab/attachment-0001.htm


More information about the StarCluster mailing list