<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 14 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
        {font-family:Tahoma;
        panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        margin-bottom:.0001pt;
        font-size:12.0pt;
        font-family:"Times New Roman","serif";}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:blue;
        text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
        {mso-style-priority:99;
        color:purple;
        text-decoration:underline;}
span.EmailStyle17
        {mso-style-type:personal-reply;
        font-family:"Calibri","sans-serif";
        color:#1F497D;}
.MsoChpDefault
        {mso-style-type:export-only;}
@page WordSection1
        {size:8.5in 11.0in;
        margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
        {page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="blue" vlink="purple">
<div class="WordSection1">
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D">Raj has commented in the past that the load balancer does not use the same logic as listclusters:
<a href="http://star.mit.edu/cluster/mlarchives/2585.html">http://star.mit.edu/cluster/mlarchives/2585.html</a><o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D">--<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D">From that archived message:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D">The elastic load balancer parses the output of 'qhost' on the cluster:
<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D"><a href="https://github.com/jtriley/StarCluster/blob/develop/starcluster/balancers/sge/__init__.py#L59">https://github.com/jtriley/StarCluster/blob/develop/starcluster/balancers/sge/__init__.py#L59</a><o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D"><o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D">I don't remember the exact reason for using that instead of the same logic as 'listclusters' above, but here's my guess a few years after the fact:
<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D">- Avoids another remote API call to AWS' tagging service to retrieve the tags for all instances within an account. This needs to be called every minute, so
a speedy call to your cluster instead of to a remote API is beneficial <o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D">- qhost outputs the number of machines correctly configured and able to process work. If a machine shows up in 'listcluster' but not in 'qhost' it's likely
not usable to process jobs, and would probably need manual cleanup. <o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D">HTH
<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D">Raj<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><b><span style="font-size:10.0pt;font-family:"Tahoma","sans-serif"">From:</span></b><span style="font-size:10.0pt;font-family:"Tahoma","sans-serif""> starcluster-bounces@mit.edu [mailto:starcluster-bounces@mit.edu]
<b>On Behalf Of </b>David Koppstein<br>
<b>Sent:</b> Monday, June 08, 2015 10:24 AM<br>
<b>To:</b> starcluster@mit.edu<br>
<b>Subject:</b> Re: [StarCluster] load balancer stopped working?<o:p></o:p></span></p>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<p class="MsoNormal" style="margin-bottom:12.0pt">Edit: <o:p></o:p></p>
<div>
<p class="MsoNormal">It appears that the load balancer thinks the cluster is not running, even though listclusters says it is and I can successfully login using sshmaster. Still can't figure out why this is the case. <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Apologies for the spam. <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<div>
<p class="MsoNormal">ubuntu@ip-10-0-0-20:~$ /opt/venv/python2_venv/bin/python /opt/venv/python2_venv/bin/starcluster -c /home/ubuntu/.starcluster/config loadbalance -n 1 -m 20 -w 300 dragon-1.3.0<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">StarCluster - (<a href="http://star.mit.edu/cluster">http://star.mit.edu/cluster</a>) (v. 0.95.6)<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Software Tools for Academics and Researchers (STAR)<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Please submit bug reports to <a href="mailto:starcluster@mit.edu">
starcluster@mit.edu</a><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">!!! ERROR - cluster dragon-1.3.0 is not running<o:p></o:p></p>
</div>
</div>
<div>
<p class="MsoNormal">ubuntu@ip-10-0-0-20:~$ /opt/venv/python2_venv/bin/python /opt/venv/python2_venv/bin/starcluster -c /home/ubuntu/.starcluster/config listclusters<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<div>
<p class="MsoNormal">-----------------------------------------------<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">dragon-1.3.0 (security group: @sc-dragon-1.3.0)<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">-----------------------------------------------<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Launch time: 2015-04-26 03:40:22<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Uptime: 43 days, 11:42:08<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">VPC: vpc-849ec2e1<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Subnet: subnet-b6901fef<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Zone: us-east-1d<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Keypair: bean_key<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">EBS volumes:<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> vol-34a33e73 on master:/dev/sdz (status: attached)<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> vol-dc7beb9b on master:/dev/sdx (status: attached)<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> vol-57148c10 on master:/dev/sdy (status: attached)<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> vol-8ba835cc on master:/dev/sdv (status: attached)<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> vol-9253ced5 on master:/dev/sdw (status: attached)<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Cluster nodes:<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> master running i-609aa79c 52.0.250.221<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> node002 running i-f4d6470b 52.4.102.101<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> node014 running i-52d6b2ad 52.7.159.255<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> node016 running i-fb9ae804 54.88.226.88<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> node017 running i-b275084d 52.5.86.254<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> node020 running i-14532eeb 52.5.111.191<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> node021 running i-874b3678 54.165.179.93<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> node022 running i-5abfc2a5 54.85.47.151<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> node023 running i-529ee3ad 52.1.197.60<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> node024 running i-0792eff8 54.172.58.21<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Total nodes: 10<o:p></o:p></p>
</div>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<div>
<p class="MsoNormal">ubuntu@ip-10-0-0-20:~$ /opt/venv/python2_venv/bin/python /opt/venv/python2_venv/bin/starcluster -c /home/ubuntu/.starcluster/config sshmaster dragon-1.3.0<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">StarCluster - (<a href="http://star.mit.edu/cluster">http://star.mit.edu/cluster</a>) (v. 0.95.6)<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Software Tools for Academics and Researchers (STAR)<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Please submit bug reports to <a href="mailto:starcluster@mit.edu">
starcluster@mit.edu</a><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">The authenticity of host '52.0.250.221 (52.0.250.221)' can't be established.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">ECDSA key fingerprint is e7:21:af:bf:2b:bf:c4:49:43:b8:dd:0b:aa:d3:81:a0.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Are you sure you want to continue connecting (yes/no)? yes<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Warning: Permanently added '52.0.250.221' (ECDSA) to the list of known hosts.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> _ _ _<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">__/\_____| |_ __ _ _ __ ___| |_ _ ___| |_ ___ _ __<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">\ / __| __/ _` | '__/ __| | | | / __| __/ _ \ '__|<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">/_ _\__ \ || (_| | | | (__| | |_| \__ \ || __/ |<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> \/ |___/\__\__,_|_| \___|_|\__,_|___/\__\___|_|<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">StarCluster Ubuntu 13.04 AMI<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Software Tools for Academics and Researchers (STAR)<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Homepage: <a href="http://star.mit.edu/cluster">http://star.mit.edu/cluster</a><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Documentation: <a href="http://star.mit.edu/cluster/docs/latest">
http://star.mit.edu/cluster/docs/latest</a><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Code: <a href="https://github.com/jtriley/StarCluster">https://github.com/jtriley/StarCluster</a><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Mailing list: <a href="http://star.mit.edu/cluster/mailinglist.html">
http://star.mit.edu/cluster/mailinglist.html</a><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">This AMI Contains:<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"> * Open Grid Scheduler (OGS - formerly SGE) queuing system<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> * Condor workload management system<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> * OpenMPI compiled with Open Grid Scheduler support<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> * OpenBLAS - Highly optimized Basic Linear Algebra Routines<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> * NumPy/SciPy linked against OpenBlas<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> * Pandas - Data Analysis Library<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> * IPython 1.1.0 with parallel and notebook support<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> * Julia 0.3pre<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> * and more! (use 'dpkg -l' to show all installed packages)<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Open Grid Scheduler/Condor cheat sheet:<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"> * qstat/condor_q - show status of batch jobs<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> * qhost/condor_status- show status of hosts, queues, and jobs<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> * qsub/condor_submit - submit batch jobs (e.g. qsub -cwd ./job.sh)<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> * qdel/condor_rm - delete batch jobs (e.g. qdel 7)<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> * qconf - configure Open Grid Scheduler system<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Current System Stats:<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"> System load: 0.0 Processes: 226<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> Usage of /: 80.8% of 78.61GB Users logged in: 2<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> Memory usage: 6% IP address for eth0: 10.0.0.213<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> Swap usage: 0%<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"> => There are 2 zombie processes.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"> <a href="https://landscape.canonical.com/">https://landscape.canonical.com/</a><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Last login: Sun Apr 26 04:50:46 2015 from <a href="http://c-24-60-255-35.hsd1.ma.comcast.net">
c-24-60-255-35.hsd1.ma.comcast.net</a><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">root@master:~#<o:p></o:p></p>
</div>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div>
<p class="MsoNormal">On Mon, Jun 8, 2015 at 11:10 AM David Koppstein <<a href="mailto:david.koppstein@gmail.com">david.koppstein@gmail.com</a>> wrote:<o:p></o:p></p>
</div>
<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-right:0in">
<div>
<p class="MsoNormal">Hi, <o:p></o:p></p>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">I noticed that my load balancer stopped working -- specifically, it has stopped deleting unnecessary nodes. It's been running fine for about three weeks. <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">I have a small T2 micro instance loadbalancing a cluster of M3.xlarge. The cluster is running Ubuntu 14.04 using the shared 14.0. AMI <span style="font-size:11.5pt;font-family:"Arial","sans-serif";color:#444444">ami-38b99850. </span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">The loadbalancer process is still running (started with nohup CMD &, where CMD is the loadbalancer command below): <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">```<o:p></o:p></p>
</div>
<div>
<div>
<p class="MsoNormal">ubuntu@ip-10-0-0-20:~$ ps -ef | grep load<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">ubuntu 11784 11730 0 15:04 pts/1 00:00:00 grep --color=auto load<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">ubuntu 19493 1 0 Apr26 ? 01:25:03 /opt/venv/python2_venv/bin/python /opt/venv/python2_venv/bin/starcluster -c /home/ubuntu/.starcluster/config loadbalance -n 1 -m 20 -w 300 dragon-1.3.0<o:p></o:p></p>
</div>
</div>
<div>
<p class="MsoNormal">```<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Queue has been empty for several days. <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">```<o:p></o:p></p>
</div>
<div>
<div>
<p class="MsoNormal"><a href="mailto:dkoppstein@master:/dkoppstein/150521SG_v1.9_round2$">dkoppstein@master:/dkoppstein/150521SG_v1.9_round2$</a> qstat -u "*"<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><a href="mailto:dkoppstein@master:/dkoppstein/150521SG_v1.9_round2$">dkoppstein@master:/dkoppstein/150521SG_v1.9_round2$</a><o:p></o:p></p>
</div>
</div>
<div>
<p class="MsoNormal">```<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">However, there are about 8 nodes that have been running over the weekend and are not being killed despite -n 1. If anyone has any guesses as to why the loadbalancer might stop working please let me know so I can prevent this from happening
in the future. <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Thanks,<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">David<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
</div>
</blockquote>
</div>
</div>
</body>
</html>