[StarCluster] load balancer stopped working?

Mon Jun 8 12:05:08 EDT 2015

Raj has commented in the past that the load balancer does not use the same logic as listclusters: http://star.mit.edu/cluster/mlarchives/2585.html
--

From that archived message:

The elastic load balancer parses the output of 'qhost' on the cluster:
https://github.com/jtriley/StarCluster/blob/develop/starcluster/balancers/sge/__init__.py#L59
I don't remember the exact reason for using that instead of the same logic as 'listclusters' above, but here's my guess a few years after the fact:
- Avoids another remote API call to AWS' tagging service to retrieve the tags for all instances within an account. This needs to be called every minute, so a speedy call to your cluster instead of to a remote API is beneficial
- qhost outputs the number of machines correctly configured and able to process work. If a machine shows up in 'listcluster' but not in 'qhost' it's likely not usable to process jobs, and would probably need manual cleanup.
HTH
Raj

From: starcluster-bounces at mit.edu [mailto:starcluster-bounces at mit.edu] On Behalf Of David Koppstein
Sent: Monday, June 08, 2015 10:24 AM
To: starcluster at mit.edu
Subject: Re: [StarCluster] load balancer stopped working?

Edit:
It appears that the load balancer thinks the cluster is not running, even though listclusters says it is and I can successfully login using sshmaster. Still can't figure out why this is the case.

Apologies for the spam.

ubuntu at ip-10-0-0-20:~$ /opt/venv/python2_venv/bin/python /opt/venv/python2_venv/bin/starcluster -c /home/ubuntu/.starcluster/config loadbalance -n 1 -m 20 -w 300 dragon-1.3.0
StarCluster - (http://star.mit.edu/cluster) (v. 0.95.6)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster at mit.edu<mailto:starcluster at mit.edu>

!!! ERROR - cluster dragon-1.3.0 is not running
ubuntu at ip-10-0-0-20:~$ /opt/venv/python2_venv/bin/python /opt/venv/python2_venv/bin/starcluster -c /home/ubuntu/.starcluster/config listclusters

-----------------------------------------------
dragon-1.3.0 (security group: @sc-dragon-1.3.0)
-----------------------------------------------
Launch time: 2015-04-26 03:40:22
Uptime: 43 days, 11:42:08
VPC: vpc-849ec2e1
Subnet: subnet-b6901fef
Zone: us-east-1d
Keypair: bean_key
EBS volumes:
    vol-34a33e73 on master:/dev/sdz (status: attached)
    vol-dc7beb9b on master:/dev/sdx (status: attached)
    vol-57148c10 on master:/dev/sdy (status: attached)
    vol-8ba835cc on master:/dev/sdv (status: attached)
    vol-9253ced5 on master:/dev/sdw (status: attached)
Cluster nodes:
     master running i-609aa79c 52.0.250.221
    node002 running i-f4d6470b 52.4.102.101
    node014 running i-52d6b2ad 52.7.159.255
    node016 running i-fb9ae804 54.88.226.88
    node017 running i-b275084d 52.5.86.254
    node020 running i-14532eeb 52.5.111.191
    node021 running i-874b3678 54.165.179.93
    node022 running i-5abfc2a5 54.85.47.151
    node023 running i-529ee3ad 52.1.197.60
    node024 running i-0792eff8 54.172.58.21
Total nodes: 10

ubuntu at ip-10-0-0-20:~$ /opt/venv/python2_venv/bin/python /opt/venv/python2_venv/bin/starcluster -c /home/ubuntu/.starcluster/config sshmaster dragon-1.3.0
StarCluster - (http://star.mit.edu/cluster) (v. 0.95.6)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster at mit.edu<mailto:starcluster at mit.edu>

The authenticity of host '52.0.250.221 (52.0.250.221)' can't be established.
ECDSA key fingerprint is e7:21:af:bf:2b:bf:c4:49:43:b8:dd:0b:aa:d3:81:a0.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '52.0.250.221' (ECDSA) to the list of known hosts.
          _                 _           _
__/\_____| |_ __ _ _ __ ___| |_   _ ___| |_ ___ _ __
\    / __| __/ _` | '__/ __| | | | / __| __/ _ \ '__|
/_  _\__ \ || (_| | | | (__| | |_| \__ \ ||  __/ |
  \/ |___/\__\__,_|_|  \___|_|\__,_|___/\__\___|_|

StarCluster Ubuntu 13.04 AMI
Software Tools for Academics and Researchers (STAR)
Homepage: http://star.mit.edu/cluster
Documentation: http://star.mit.edu/cluster/docs/latest
Code: https://github.com/jtriley/StarCluster
Mailing list: http://star.mit.edu/cluster/mailinglist.html

This AMI Contains:

  * Open Grid Scheduler (OGS - formerly SGE) queuing system
  * Condor workload management system
  * OpenMPI compiled with Open Grid Scheduler support
  * OpenBLAS - Highly optimized Basic Linear Algebra Routines
  * NumPy/SciPy linked against OpenBlas
  * Pandas - Data Analysis Library
  * IPython 1.1.0 with parallel and notebook support
  * Julia 0.3pre
  * and more! (use 'dpkg -l' to show all installed packages)

Open Grid Scheduler/Condor cheat sheet:

  * qstat/condor_q - show status of batch jobs
  * qhost/condor_status- show status of hosts, queues, and jobs
  * qsub/condor_submit - submit batch jobs (e.g. qsub -cwd ./job.sh)
  * qdel/condor_rm - delete batch jobs (e.g. qdel 7)
  * qconf - configure Open Grid Scheduler system

Current System Stats:

  System load:  0.0                Processes:           226
  Usage of /:   80.8% of 78.61GB   Users logged in:     2
  Memory usage: 6%                 IP address for eth0: 10.0.0.213
  Swap usage:   0%

  => There are 2 zombie processes.

    https://landscape.canonical.com/
Last login: Sun Apr 26 04:50:46 2015 from c-24-60-255-35.hsd1.ma.comcast.net<http://c-24-60-255-35.hsd1.ma.comcast.net>
root at master:~#

On Mon, Jun 8, 2015 at 11:10 AM David Koppstein <david.koppstein at gmail.com<mailto:david.koppstein at gmail.com>> wrote:
Hi,

I noticed that my load balancer stopped working -- specifically, it has stopped deleting unnecessary nodes. It's been running fine for about three weeks.

I have a small T2 micro instance loadbalancing a cluster of M3.xlarge. The cluster is running Ubuntu 14.04 using the shared 14.0. AMI ami-38b99850.

The loadbalancer process is still running (started with nohup CMD &, where CMD is the loadbalancer command below):

```
ubuntu at ip-10-0-0-20:~$ ps -ef | grep load
ubuntu   11784 11730  0 15:04 pts/1    00:00:00 grep --color=auto load
ubuntu   19493     1  0 Apr26 ?        01:25:03 /opt/venv/python2_venv/bin/python /opt/venv/python2_venv/bin/starcluster -c /home/ubuntu/.starcluster/config loadbalance -n 1 -m 20 -w 300 dragon-1.3.0
```

Queue has been empty for several days.

```
dkoppstein at master:/dkoppstein/150521SG_v1.9_round2$<mailto:dkoppstein at master:/dkoppstein/150521SG_v1.9_round2$> qstat -u "*"
dkoppstein at master:/dkoppstein/150521SG_v1.9_round2$<mailto:dkoppstein at master:/dkoppstein/150521SG_v1.9_round2$>
```

However, there are about 8 nodes that have been running over the weekend and are not being killed despite -n 1. If anyone has any guesses as to why the loadbalancer might stop working please let me know so I can prevent this from happening in the future.

Thanks,
David

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20150608/fd770f3e/attachment-0001.htm