<html dir="ltr">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
<style>
<!--
@font-face
        {font-family:"Cambria Math"}
@font-face
        {font-family:Calibri}
@font-face
        {font-family:"Segoe UI"}
@font-face
        {font-family:Tahoma}
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        margin-bottom:.0001pt;
        font-size:12.0pt;
        font-family:"Times New Roman","serif"}
a:link, span.MsoHyperlink
        {color:blue;
        text-decoration:underline}
a:visited, span.MsoHyperlinkFollowed
        {color:purple;
        text-decoration:underline}
p
        {margin-right:0in;
        margin-left:0in;
        font-size:12.0pt;
        font-family:"Times New Roman","serif"}
span.EmailStyle18
        {font-family:"Calibri","sans-serif";
        color:#1F497D}
.MsoChpDefault
        {font-family:"Calibri","sans-serif"}
@page WordSection1
        {margin:1.0in 1.0in 1.0in 1.0in}
-->
</style><style type="text/css" id="owaParaStyle"></style>
</head>
<body lang="EN-US" link="blue" vlink="purple" fpstyle="1" ocsi="0">
<div style="direction: ltr;font-family: Tahoma;color: #000000;font-size: 10pt;">Thanks for that plugin suggestion! I tried avoiding the master node through &quot;qsub -l&quot;, and so far that seems to be helping!
<div><br>
As for using a t2.micro or small, those don't seem to be an option in starcluster? They are not listed as instance options in my config file, and when I previously tried an option outside of those listed, I got an error. If anyone knows a way around that, I'd
 be interested!</div>
<div><br>
</div>
<div>Thanks to both of you</div>
<div>Amanda</div>
<div><br>
<div style="font-family: Times New Roman; color: #000000; font-size: 16px">
<hr tabindex="-1">
<div id="divRpF828017" style="direction: ltr;"><font face="Tahoma" size="2" color="#000000"><b>From:</b> MacMullan, Hugh [hughmac@wharton.upenn.edu]<br>
<b>Sent:</b> Tuesday, September 23, 2014 2:50 PM<br>
<b>To:</b> Amanda Joy Kedaigle<br>
<b>Cc:</b> starcluster@mit.edu<br>
<b>Subject:</b> RE: [StarCluster] FW: commlib error<br>
</font><br>
</div>
<div></div>
<div>
<div class="WordSection1">
<p class="MsoNormal"><span style="font-size:11.0pt; font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;; color:#1F497D">Amanda:</span></p>
<p class="MsoNormal"><span style="font-size:11.0pt; font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;; color:#1F497D">&nbsp;</span></p>
<p class="MsoNormal"><span style="font-size:11.0pt; font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;; color:#1F497D">I agree with Rajat's t2 suggestion … even just a t2.micro will help over a t1.micro … and it's cheaper!</span></p>
<p class="MsoNormal"><span style="font-size:11.0pt; font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;; color:#1F497D">&nbsp;</span></p>
<p class="MsoNormal"><span style="font-size:11.0pt; font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;; color:#1F497D">And are you running jobs on the master as well as the nodes? If so, you could disable that:</span></p>
<p class="MsoNormal"><span style="font-size:11.0pt; font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;; color:#1F497D">&nbsp;</span></p>
<p class="MsoNormal"><span style="font-size:11.0pt; font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;; color:#1F497D">[cluster mycluster]</span></p>
<p class="MsoNormal"><span style="font-size:11.0pt; font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;; color:#1F497D">DISABLE_QUEUE=True</span></p>
<p class="MsoNormal"><span style="font-size:11.0pt; font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;; color:#1F497D">PLUGINS = sge</span></p>
<p class="MsoNormal"><span style="font-size:11.0pt; font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;; color:#1F497D">&nbsp;</span></p>
<p class="MsoNormal"><span style="font-size:11.0pt; font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;; color:#1F497D">[plugin sge]</span></p>
<p class="MsoNormal"><span style="font-size:11.0pt; font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;; color:#1F497D">setup_class = starcluster.plugins.sge.SGEPlugin</span></p>
<p class="MsoNormal"><span style="font-size:11.0pt; font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;; color:#1F497D">master_is_exec_host = False</span></p>
<p class="MsoNormal"><span style="font-size:11.0pt; font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;; color:#1F497D">&nbsp;</span></p>
<p class="MsoNormal"><span style="font-size:11.0pt; font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;; color:#1F497D">That might help with stability a good bit.</span></p>
<p class="MsoNormal"><span style="font-size:11.0pt; font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;; color:#1F497D">&nbsp;</span></p>
<p class="MsoNormal"><span style="font-size:11.0pt; font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;; color:#1F497D">You can also use spot pricing for the master to get a beefier master for a much lower price … but of course with the risk of losing the whole cluster if you
 are outbid.</span></p>
<p class="MsoNormal"><span style="font-size:11.0pt; font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;; color:#1F497D">&nbsp;</span></p>
<p class="MsoNormal"><span style="font-size:11.0pt; font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;; color:#1F497D">Good luck with the project!</span></p>
<p class="MsoNormal"><span style="font-size:11.0pt; font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;; color:#1F497D">&nbsp;</span></p>
<p class="MsoNormal"><span style="font-size:11.0pt; font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;; color:#1F497D">-Hugh</span></p>
<p class="MsoNormal"><span style="font-size:11.0pt; font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;; color:#1F497D">&nbsp;</span></p>
<p class="MsoNormal"><b><span style="font-size:11.0pt; font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;">From:</span></b><span style="font-size:11.0pt; font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;"> starcluster-bounces@mit.edu [mailto:starcluster-bounces@mit.edu]
<b>On Behalf Of </b>Rajat Banerjee<br>
<b>Sent:</b> Tuesday, September 23, 2014 1:11 PM<br>
<b>To:</b> Amanda Joy Kedaigle<br>
<b>Cc:</b> starcluster@mit.edu<br>
<b>Subject:</b> Re: [StarCluster] FW: commlib error</span></p>
<p class="MsoNormal">&nbsp;</p>
<div>
<div>
<div>
<p class="MsoNormal">HI Amanda,</p>
</div>
<p class="MsoNormal" style="margin-bottom:12.0pt">I googled your error and found a few pages that suggest that sge service on the master node went down:<br>
<br>
<a href="http://verahill.blogspot.com/2012/08/sun-gridengine-commlib-error-got-select.html" target="_blank">http://verahill.blogspot.com/2012/08/sun-gridengine-commlib-error-got-select.html</a><br>
<br>
<a href="https://supcom.hgc.jp/english/utili_info/manual/faq.html" target="_blank">https://supcom.hgc.jp/english/utili_info/manual/faq.html</a><br>
<br>
<a href="http://comments.gmane.org/gmane.comp.clustering.gridengine.users/17283" target="_blank">http://comments.gmane.org/gmane.comp.clustering.gridengine.users/17283</a></p>
</div>
<p class="MsoNormal">If your OpenBLAS command is killing the process on master that could cause your issues according to those authors. Sorry I don't have anything more helpful, but the t2.small is still less than $.03 per hour now. That may not increase your
 costs too much.<br>
<br>
Raj</p>
</div>
<div>
<p class="MsoNormal">&nbsp;</p>
<div>
<p class="MsoNormal">On Tue, Sep 23, 2014 at 12:55 PM, Amanda Joy Kedaigle &lt;<a href="mailto:mandyjoy@mit.edu" target="_blank">mandyjoy@mit.edu</a>&gt; wrote:</p>
<blockquote style="border:none; border-left:solid #CCCCCC 1.0pt; padding:0in 0in 0in 6.0pt; margin-left:4.8pt; margin-right:0in">
<div>
<div>
<div>
<div>
<div>
<div>
<p class="MsoNormal"><span style="font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;; color:black">Thanks, Raj. I can communicate with the master node, it just looks like SGE is failing. I restarted the cluster and everything seemed to be working, but then it just failed in
 the same way again.</span><span style="color:black"></span></p>
</div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<p class="MsoNormal"><span style="color:black">&nbsp;</span></p>
</div>
<p class="MsoNormal"><span style="color:black">&gt; starcluster listclusters (should list status of all your active clusters and running nodes)</span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:black">&nbsp;</span></p>
</div>
<div>
<p><span style="color:black">-----------------------------------------------------</span></p>
<p><span style="color:black">fraenkelcluster (security group: @sc-fraenkelcluster)</span></p>
<p><span style="color:black">-----------------------------------------------------</span></p>
<p><span style="color:black">Launch time: 2014-09-23 11:59:43</span></p>
<p><span style="color:black">Uptime: 0 days, 00:45:58</span></p>
<p><span style="color:black">VPC: vpc-c71f0fa5</span></p>
<p><span style="color:black">Subnet: subnet-e6b8c8ce</span></p>
<p><span style="color:black">Zone: us-east-1c</span></p>
<p><span style="color:black">Keypair: fraenkel-keypair</span></p>
<p><span style="color:black">EBS volumes:</span></p>
<p><span style="color:black">&nbsp; &nbsp; vol-5e75ba11 on master:/dev/sdz (status: attached)</span></p>
<p><span style="color:black">Cluster nodes:</span></p>
<p><span style="color:black">&nbsp;&nbsp; &nbsp; master running i-acc76242 <a href="http://ec2-54-164-81-80.compute-1.amazonaws.com" target="_blank">
ec2-54-164-81-80.compute-1.amazonaws.com</a></span></p>
<p><span style="color:black">&nbsp; &nbsp; node001 running i-5177ddbf <a href="http://ec2-54-164-98-38.compute-1.amazonaws.com" target="_blank">
ec2-54-164-98-38.compute-1.amazonaws.com</a></span></p>
<p><span style="color:black">&nbsp; &nbsp; node002 running i-9976c077 <a href="http://ec2-54-164-88-184.compute-1.amazonaws.com" target="_blank">
ec2-54-164-88-184.compute-1.amazonaws.com</a></span></p>
<p><span style="color:black">&nbsp; &nbsp; node003 running i-9e76c070 <a href="http://ec2-54-164-38-146.compute-1.amazonaws.com" target="_blank">
ec2-54-164-38-146.compute-1.amazonaws.com</a></span></p>
<p><span style="color:black">&nbsp; &nbsp; node004 running i-1776c0f9 <a href="http://ec2-54-86-252-119.compute-1.amazonaws.com" target="_blank">
ec2-54-86-252-119.compute-1.amazonaws.com</a></span></p>
<p><span style="color:black">&nbsp; &nbsp; node005 running i-1676c0f8 <a href="http://ec2-54-165-66-3.compute-1.amazonaws.com" target="_blank">
ec2-54-165-66-3.compute-1.amazonaws.com</a></span></p>
<p><span style="color:black">Total nodes: 6</span></p>
<p class="MsoNormal"><span style="color:black">&nbsp;</span></p>
</div>
<p class="MsoNormal"><span style="color:black">&gt; starcluster sshmaster &lt;your cluster name&gt;</span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:black">&nbsp;</span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:10.0pt; font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;; color:black">works just fine, I am ssh'd into master under root user.</span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:10.0pt; font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;; color:black">&nbsp;</span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:10.0pt; font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;; color:black">Some more details: I am wondering if this is because my master node is a t1.micro - either it is an older generation and not updated, or doesn't have enough
 memory to run the queue? When doing my initial tests, running thousands of simple jobs, it worked fine, and the load balancer added and deleted nodes as expected. However, when running slightly more intensive jobs, including the python module networkx, the
 jobs give this error and then SGE dies:</span></p>
</div>
<div>
<div>
<p class="MsoNormal"><span style="font-size:13.5pt; font-family:&quot;Segoe UI&quot;,&quot;sans-serif&quot;; color:black">OpenBLAS : Your OS does not support AVX instructions. OpenBLAS is using Nehalem kernels as a fallback, which may give poorer performance.</span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:13.5pt; font-family:&quot;Segoe UI&quot;,&quot;sans-serif&quot;; color:black">Killed</span></p>
</div>
</div>
<div>
<p class="MsoNormal"><span style="font-size:13.5pt; font-family:&quot;Segoe UI&quot;,&quot;sans-serif&quot;; color:black">&nbsp;</span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:10.0pt; font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;; color:black">I would really like to have a very cheap master node since I expect to keep it running 24/7, but only use the cluster in bursts.&nbsp;</span></p>
</div>
</div>
<div>
<p class="MsoNormal"><span style="color:black">&nbsp;</span></p>
<div>
<p class="MsoNormal"><span style="color:black">On Mon, Sep 22, 2014 at 5:13 PM, Amanda Joy Kedaigle &lt;<a href="mailto:mandyjoy@mit.edu" target="_blank">mandyjoy@mit.edu</a>&gt; wrote:</span></p>
<blockquote style="border:none; border-left:solid #CCCCCC 1.0pt; padding:0in 0in 0in 6.0pt; margin-left:4.8pt; margin-right:0in">
<div>
<div>
<p class="MsoNormal"><span style="font-size:10.0pt; font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;; color:black">Hi,
<br>
<br>
I am trying to run starcluster's loadbalancer to keep only one node running until jobs are submitted to the cluster. I know it's an experimental feature, but I'm wondering if anyone has run into this error before, or has any suggestions. The cluster has been
 whittled down to 1 node after a weekend of inactivity, and now it seems that when jobs are submitted to the queue, instead of adding nodes, SGE fails.<br>
<br>
&gt;&gt;&gt; Loading full job history<br>
*** WARNING - Failed to retrieve stats (1/5):<br>
Traceback (most recent call last):<br>
&nbsp; File &quot;/net/dorsal/apps/python2.7/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/balancers/sge/__init__.py&quot;, line 552, in get_stats<br>
&nbsp;&nbsp;&nbsp; return self._get_stats()<br>
&nbsp; File &quot;/net/dorsal/apps/python2.7/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/balancers/sge/__init__.py&quot;, line 522, in _get_stats<br>
&nbsp;&nbsp;&nbsp; qhostxml = '\n'.join(master.ssh.execute('qhost -xml'))<br>
&nbsp; File &quot;/net/dorsal/apps/python2.7/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/sshutils.py&quot;, line 578, in execute<br>
&nbsp;&nbsp;&nbsp; msg, command, exit_status, out_str)<br>
RemoteCommandFailed: remote command 'source /etc/profile &amp;&amp; qhost -xml' failed with status 1:<br>
error: commlib error: got select error (Connection refused)<br>
error: unable to send message to qmaster using port 63231 on host &quot;master&quot;: got send error<br>
<br>
Thanks for any help!</span><span style="font-size:10.0pt; font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;; color:#888888"><br>
Amanda</span><span style="font-size:10.0pt; font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;; color:black"></span></p>
</div>
</div>
<p class="MsoNormal" style="margin-bottom:12.0pt"><span style="color:black"><br>
_______________________________________________<br>
StarCluster mailing list<br>
<a href="mailto:StarCluster@mit.edu" target="_blank">StarCluster@mit.edu</a><br>
<a href="http://mailman.mit.edu/mailman/listinfo/starcluster" target="_blank">http://mailman.mit.edu/mailman/listinfo/starcluster</a></span></p>
</blockquote>
</div>
<p class="MsoNormal"><span style="color:black">&nbsp;</span></p>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<p class="MsoNormal" style="margin-bottom:12.0pt"><br>
_______________________________________________<br>
StarCluster mailing list<br>
<a href="mailto:StarCluster@mit.edu" target="_blank">StarCluster@mit.edu</a><br>
<a href="http://mailman.mit.edu/mailman/listinfo/starcluster" target="_blank">http://mailman.mit.edu/mailman/listinfo/starcluster</a></p>
</blockquote>
</div>
<p class="MsoNormal">&nbsp;</p>
</div>
</div>
</div>
</div>
</div>
</div>
</body>
</html>