<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">

<style type="text/css" style="display:none"><!--P{margin-top:0;margin-bottom:0;} .ms-cui-menu {background-color:#ffffff;border:1px rgb(171, 171, 171) solid;font-family:'Segoe UI WPC', 'Segoe UI', Tahoma, 'Microsoft Sans Serif', Verdana, sans-serif;font-size:11pt;color:rgb(51, 51, 51);} .ms-cui-menusection-title {display:none;} .ms-cui-ctl {vertical-align:text-top;text-decoration:none;color:rgb(51, 51, 51);} .ms-cui-ctl-on {background-color:rgb(223, 237, 250);opacity: 0.8;} .ms-cui-img-cont-float {display:inline-block;margin-top:2px} .ms-cui-smenu-inner {padding-top:0px;} .ms-owa-paste-option-icon {margin: 2px 4px 0px 4px;vertical-align:sub;padding-bottom: 2px;display:inline-block;} .ms-rtePasteFlyout-option:hover {background-color:rgb(223, 237, 250) !important;opacity:1 !important;} .ms-rtePasteFlyout-option {padding:8px 4px 8px 4px;outline:none;} .ms-cui-menusection {float:left; width:85px;height:24px;overflow:hidden}--></style>

</head>

<body>

<div style="font-size:12pt;color:#000000;background-color:#FFFFFF;font-family:Calibri,Arial,Helvetica,sans-serif;">

<p>Hello Fellow Starclusterers!<br>

</p>

<p><br>

</p>

<p>We've been using the starcluster as an experiment for training some models for a couple of months and it's been great in use! Easy to set up and easy to use.</p>

<p>But now we are considering including it as a more permanent member into our tech stack and&nbsp;<span style="font-size: 12pt;">I'm looking</span><span style="font-size: 12pt;"> more deeply into how it behaves in cases of failure, which is expected on Amazon.

 After some amount of Googling, I did find a few papers that describe clusters of large size (1000 nodes, even 10000 nodes) but somehow I found&nbsp;very little discussion about possible&nbsp;failures&nbsp;and&nbsp;recovery. We are also taking advantage of spot instances, so the

 failure of those is expected even more frequently than the regular &quot;retail&quot; nodes on Amazon.</span></p>

<p><span style="font-size: 12pt;"><br>

</span></p>

<p><span style="font-size: 12pt;">I would appreciate very much,&nbsp;if someone on this list pointed me to any resources / documentation / discussion out there regarding what can be expected from Starcluster in cases of failure. Also, it would be great to know what

 other features might be on the drawing board, as we might be able to help build them!</span></p>

<p><span style="font-size: 12pt;"><br>

</span></p>

<p>Specifically, I'm trying to find answer to the following questions. Would appreciate very much any experiences or resources that anyone can share on this!<br>

</p>

<p><br>

</p>

<p><span style="font-size: 12pt;">-</span><span style="font-size: 12pt;">&nbsp;</span><span style="font-size: 12pt;">I</span><span style="font-size: 12pt;">f a node fails, is&nbsp;</span><span style="font-size: 12pt;">SGE/</span><span style="font-size: 12pt;">S</span><span style="font-size: 12pt;">tarcluster</span><span style="font-size: 12pt;">&nbsp;</span><span style="font-size: 12pt;">able

 to&nbsp;</span><span style="font-size: 12pt;">detect this properly</span><span style="font-size: 12pt;">?</span><span style="font-size: 12pt;">&nbsp;</span><br>

</p>

<p><span style="font-size: 12pt;">- What happens to the jobs running on the failed node? Are they retried? Can they be configured to be retired? Does this work reliably?</span></p>

<p><span style="font-size: 12pt;">- What happens to SGE jobs if the master node dies?</span></p>

<p><span style="font-size: 12pt;">- Can the cluster be recovered if the master node is restarted? Is is a single point of failure?</span></p>

<p><span style="font-size: 12pt;">- If yes, does SGE itself support more redundancy&nbsp;</span><span style="font-size: 12pt;">than</span><span style="font-size: 12pt;"> what is available as configured in Starcluster? Some diagrams in&nbsp;this presentation seems to

 imply so&nbsp;</span><a href="http://beowulf.rutgers.edu/info-user/pdf/ge_presentation.pdf)" style="font-size: 12pt;">http://beowulf.rutgers.edu/info-user/pdf/ge_presentation.pdf</a><br>

</p>

<p><br>

</p>

<p>- If one uses Starcluster without the SGE, what is the behavior when master node dies? Can the cluster be recovered from this?<br>

</p>

<p>- What if we limit the use of NFS and instead use a separate system for data storage which provides its own high availability. Does this improve ability of the starcluster to recover from failure of nodes and the master?<br>

</p>

<p><br>

</p>

<p>Thanks very much for any information or anecdotes&nbsp;along these lines!<br>

</p>

<p>Best regards!<br>

</p>

<p>-Dmitry<br>

</p>

<p><br>

</p>

</div>

</body>

</html>