<div dir="ltr">Thanks, Steve. Indeed, I also noticed that the starcluster rn command wasn't working, which calls qconf:<div><br></div><div><span style="font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:11.8999996185303px;line-height:inherit;color:rgb(51,51,51);background-color:transparent">git:(master) ✗ starcluster -c output/starcluster_config.ini rn -n 8 dragon-1.3.0</span><pre style="overflow:auto;font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:11.8999996185303px;margin-top:0px;font-stretch:normal;line-height:1.45;padding:16px;border-radius:3px;word-wrap:normal;color:rgb(51,51,51);margin-bottom:0px!important;background-color:rgb(247,247,247)"><code style="font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:11.8999996185303px;padding:0px;margin:0px;border-radius:3px;word-break:normal;border:0px;display:inline;max-width:initial;overflow:initial;line-height:inherit;word-wrap:normal;background:transparent">StarCluster - (<a href="http://star.mit.edu/cluster">http://star.mit.edu/cluster</a>) (v. 0.95.6)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to <a href="mailto:starcluster@mit.edu">starcluster@mit.edu</a>
*** WARNING - Setting 'AWS_SECRET_ACCESS_KEY' from environment...
*** WARNING - Setting 'AWS_ACCESS_KEY_ID' from environment...
Remove 8 nodes from dragon-1.3.0 (y/n)? y
>>> Running plugin starcluster.plugins.users.CreateUsers
>>> Running plugin starcluster.plugins.sge.SGEPlugin
>>> Removing node024 from SGE
!!! ERROR - Error occured while running plugin 'starcluster.plugins.sge.SGEPlugin':
!!! ERROR - remote command 'source /etc/profile && qconf -dconf node024'
!!! ERROR - failed with status 1:
!!! ERROR - can't resolve hostname "node024"
!!! ERROR - can't delete configuration "node024" from list:
!!! ERROR - configuration does not exist</code></pre><div><br></div>So it looks like for some reason the cluster was in a state where one machine was still running and in the starcluster security group but wasn't configured to run jobs. If anyone has run into this behavior before and knows how to prevent it from happening I'd appreciate the feedback as the cost of ten big nodes adds up quickly =)<br><br>Thanks,</div><div>David</div></div><br><div class="gmail_quote"><div dir="ltr">On Mon, Jun 8, 2015 at 12:04 PM Steve Darnell <<a href="mailto:darnells@dnastar.com">darnells@dnastar.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div lang="EN-US" link="blue" vlink="purple">
<div>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">Raj has commented in the past that the load balancer does not use the same logic as listclusters:
<a href="http://star.mit.edu/cluster/mlarchives/2585.html" target="_blank">http://star.mit.edu/cluster/mlarchives/2585.html</a><u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">--<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">From that archived message:<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">The elastic load balancer parses the output of 'qhost' on the cluster:
<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"><a href="https://github.com/jtriley/StarCluster/blob/develop/starcluster/balancers/sge/__init__.py#L59" target="_blank">https://github.com/jtriley/StarCluster/blob/develop/starcluster/balancers/sge/__init__.py#L59</a><u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"><u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">I don't remember the exact reason for using that instead of the same logic as 'listclusters' above, but here's my guess a few years after the fact:
<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">- Avoids another remote API call to AWS' tagging service to retrieve the tags for all instances within an account. This needs to be called every minute, so
a speedy call to your cluster instead of to a remote API is beneficial <u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">- qhost outputs the number of machines correctly configured and able to process work. If a machine shows up in 'listcluster' but not in 'qhost' it's likely
not usable to process jobs, and would probably need manual cleanup. <u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">HTH
<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">Raj<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"><u></u> <u></u></span></p>
<p class="MsoNormal"><b><span style="font-size:10.0pt;font-family:"Tahoma","sans-serif"">From:</span></b><span style="font-size:10.0pt;font-family:"Tahoma","sans-serif""> <a href="mailto:starcluster-bounces@mit.edu" target="_blank">starcluster-bounces@mit.edu</a> [mailto:<a href="mailto:starcluster-bounces@mit.edu" target="_blank">starcluster-bounces@mit.edu</a>]
<b>On Behalf Of </b>David Koppstein<br>
<b>Sent:</b> Monday, June 08, 2015 10:24 AM<br>
<b>To:</b> <a href="mailto:starcluster@mit.edu" target="_blank">starcluster@mit.edu</a><br>
<b>Subject:</b> Re: [StarCluster] load balancer stopped working?<u></u><u></u></span></p></div></div><div lang="EN-US" link="blue" vlink="purple"><div>
<p class="MsoNormal"><u></u> <u></u></p>
<div>
<p class="MsoNormal" style="margin-bottom:12.0pt">Edit: <u></u><u></u></p>
<div>
<p class="MsoNormal">It appears that the load balancer thinks the cluster is not running, even though listclusters says it is and I can successfully login using sshmaster. Still can't figure out why this is the case. <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">Apologies for the spam. <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<div>
<p class="MsoNormal">ubuntu@ip-10-0-0-20:~$ /opt/venv/python2_venv/bin/python /opt/venv/python2_venv/bin/starcluster -c /home/ubuntu/.starcluster/config loadbalance -n 1 -m 20 -w 300 dragon-1.3.0<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">StarCluster - (<a href="http://star.mit.edu/cluster" target="_blank">http://star.mit.edu/cluster</a>) (v. 0.95.6)<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Software Tools for Academics and Researchers (STAR)<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Please submit bug reports to <a href="mailto:starcluster@mit.edu" target="_blank">
starcluster@mit.edu</a><u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">!!! ERROR - cluster dragon-1.3.0 is not running<u></u><u></u></p>
</div>
</div>
<div>
<p class="MsoNormal">ubuntu@ip-10-0-0-20:~$ /opt/venv/python2_venv/bin/python /opt/venv/python2_venv/bin/starcluster -c /home/ubuntu/.starcluster/config listclusters<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<div>
<p class="MsoNormal">-----------------------------------------------<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">dragon-1.3.0 (security group: @sc-dragon-1.3.0)<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">-----------------------------------------------<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Launch time: 2015-04-26 03:40:22<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Uptime: 43 days, 11:42:08<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">VPC: vpc-849ec2e1<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Subnet: subnet-b6901fef<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Zone: us-east-1d<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Keypair: bean_key<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">EBS volumes:<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> vol-34a33e73 on master:/dev/sdz (status: attached)<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> vol-dc7beb9b on master:/dev/sdx (status: attached)<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> vol-57148c10 on master:/dev/sdy (status: attached)<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> vol-8ba835cc on master:/dev/sdv (status: attached)<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> vol-9253ced5 on master:/dev/sdw (status: attached)<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Cluster nodes:<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> master running i-609aa79c 52.0.250.221<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> node002 running i-f4d6470b 52.4.102.101<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> node014 running i-52d6b2ad 52.7.159.255<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> node016 running i-fb9ae804 54.88.226.88<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> node017 running i-b275084d 52.5.86.254<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> node020 running i-14532eeb 52.5.111.191<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> node021 running i-874b3678 54.165.179.93<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> node022 running i-5abfc2a5 54.85.47.151<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> node023 running i-529ee3ad 52.1.197.60<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> node024 running i-0792eff8 54.172.58.21<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Total nodes: 10<u></u><u></u></p>
</div>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<div>
<p class="MsoNormal">ubuntu@ip-10-0-0-20:~$ /opt/venv/python2_venv/bin/python /opt/venv/python2_venv/bin/starcluster -c /home/ubuntu/.starcluster/config sshmaster dragon-1.3.0<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">StarCluster - (<a href="http://star.mit.edu/cluster" target="_blank">http://star.mit.edu/cluster</a>) (v. 0.95.6)<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Software Tools for Academics and Researchers (STAR)<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Please submit bug reports to <a href="mailto:starcluster@mit.edu" target="_blank">
starcluster@mit.edu</a><u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">The authenticity of host '52.0.250.221 (52.0.250.221)' can't be established.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">ECDSA key fingerprint is e7:21:af:bf:2b:bf:c4:49:43:b8:dd:0b:aa:d3:81:a0.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Are you sure you want to continue connecting (yes/no)? yes<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Warning: Permanently added '52.0.250.221' (ECDSA) to the list of known hosts.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> _ _ _<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">__/\_____| |_ __ _ _ __ ___| |_ _ ___| |_ ___ _ __<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">\ / __| __/ _` | '__/ __| | | | / __| __/ _ \ '__|<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">/_ _\__ \ || (_| | | | (__| | |_| \__ \ || __/ |<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> \/ |___/\__\__,_|_| \___|_|\__,_|___/\__\___|_|<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">StarCluster Ubuntu 13.04 AMI<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Software Tools for Academics and Researchers (STAR)<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Homepage: <a href="http://star.mit.edu/cluster" target="_blank">http://star.mit.edu/cluster</a><u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Documentation: <a href="http://star.mit.edu/cluster/docs/latest" target="_blank">
http://star.mit.edu/cluster/docs/latest</a><u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Code: <a href="https://github.com/jtriley/StarCluster" target="_blank">https://github.com/jtriley/StarCluster</a><u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Mailing list: <a href="http://star.mit.edu/cluster/mailinglist.html" target="_blank">
http://star.mit.edu/cluster/mailinglist.html</a><u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">This AMI Contains:<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal"> * Open Grid Scheduler (OGS - formerly SGE) queuing system<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> * Condor workload management system<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> * OpenMPI compiled with Open Grid Scheduler support<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> * OpenBLAS - Highly optimized Basic Linear Algebra Routines<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> * NumPy/SciPy linked against OpenBlas<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> * Pandas - Data Analysis Library<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> * IPython 1.1.0 with parallel and notebook support<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> * Julia 0.3pre<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> * and more! (use 'dpkg -l' to show all installed packages)<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">Open Grid Scheduler/Condor cheat sheet:<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal"> * qstat/condor_q - show status of batch jobs<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> * qhost/condor_status- show status of hosts, queues, and jobs<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> * qsub/condor_submit - submit batch jobs (e.g. qsub -cwd ./job.sh)<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> * qdel/condor_rm - delete batch jobs (e.g. qdel 7)<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> * qconf - configure Open Grid Scheduler system<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">Current System Stats:<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal"> System load: 0.0 Processes: 226<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> Usage of /: 80.8% of 78.61GB Users logged in: 2<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> Memory usage: 6% IP address for eth0: 10.0.0.213<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> Swap usage: 0%<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal"> => There are 2 zombie processes.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal"> <a href="https://landscape.canonical.com/" target="_blank">https://landscape.canonical.com/</a><u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Last login: Sun Apr 26 04:50:46 2015 from <a href="http://c-24-60-255-35.hsd1.ma.comcast.net" target="_blank">
c-24-60-255-35.hsd1.ma.comcast.net</a><u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">root@master:~#<u></u><u></u></p>
</div>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
</div>
<p class="MsoNormal"><u></u> <u></u></p>
<div>
<div>
<p class="MsoNormal">On Mon, Jun 8, 2015 at 11:10 AM David Koppstein <<a href="mailto:david.koppstein@gmail.com" target="_blank">david.koppstein@gmail.com</a>> wrote:<u></u><u></u></p>
</div>
<blockquote style="border:none;border-left:solid #cccccc 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-right:0in">
<div>
<p class="MsoNormal">Hi, <u></u><u></u></p>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">I noticed that my load balancer stopped working -- specifically, it has stopped deleting unnecessary nodes. It's been running fine for about three weeks. <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">I have a small T2 micro instance loadbalancing a cluster of M3.xlarge. The cluster is running Ubuntu 14.04 using the shared 14.0. AMI <span style="font-size:11.5pt;font-family:"Arial","sans-serif";color:#444444">ami-38b99850. </span><u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">The loadbalancer process is still running (started with nohup CMD &, where CMD is the loadbalancer command below): <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">```<u></u><u></u></p>
</div>
<div>
<div>
<p class="MsoNormal">ubuntu@ip-10-0-0-20:~$ ps -ef | grep load<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">ubuntu 11784 11730 0 15:04 pts/1 00:00:00 grep --color=auto load<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">ubuntu 19493 1 0 Apr26 ? 01:25:03 /opt/venv/python2_venv/bin/python /opt/venv/python2_venv/bin/starcluster -c /home/ubuntu/.starcluster/config loadbalance -n 1 -m 20 -w 300 dragon-1.3.0<u></u><u></u></p>
</div>
</div>
<div>
<p class="MsoNormal">```<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">Queue has been empty for several days. <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">```<u></u><u></u></p>
</div>
<div>
<div>
<p class="MsoNormal"><a href="mailto:dkoppstein@master:/dkoppstein/150521SG_v1.9_round2$" target="_blank">dkoppstein@master:/dkoppstein/150521SG_v1.9_round2$</a> qstat -u "*"<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><a href="mailto:dkoppstein@master:/dkoppstein/150521SG_v1.9_round2$" target="_blank">dkoppstein@master:/dkoppstein/150521SG_v1.9_round2$</a><u></u><u></u></p>
</div>
</div>
<div>
<p class="MsoNormal">```<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">However, there are about 8 nodes that have been running over the weekend and are not being killed despite -n 1. If anyone has any guesses as to why the loadbalancer might stop working please let me know so I can prevent this from happening
in the future. <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">Thanks,<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">David<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
</div>
</blockquote>
</div>
</div></div></blockquote></div>