[StarCluster] load balancer stopped working?

Mon Jun 8 12:36:57 EDT 2015

Thanks, Steve. Indeed, I also noticed that the starcluster rn command
wasn't working, which calls qconf:

git:(master) ✗ starcluster -c output/starcluster_config.ini rn -n 8
dragon-1.3.0

StarCluster - (http://star.mit.edu/cluster) (v. 0.95.6)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster at mit.edu

*** WARNING - Setting 'AWS_SECRET_ACCESS_KEY' from environment...
*** WARNING - Setting 'AWS_ACCESS_KEY_ID' from environment...
Remove 8 nodes from dragon-1.3.0 (y/n)? y
>>> Running plugin starcluster.plugins.users.CreateUsers
>>> Running plugin starcluster.plugins.sge.SGEPlugin
>>> Removing node024 from SGE
!!! ERROR - Error occured while running plugin
'starcluster.plugins.sge.SGEPlugin':
!!! ERROR - remote command 'source /etc/profile && qconf -dconf node024'
!!! ERROR - failed with status 1:
!!! ERROR - can't resolve hostname "node024"
!!! ERROR - can't delete configuration "node024" from list:
!!! ERROR - configuration does not exist

So it looks like for some reason the cluster was in a state where one
machine was still running and in the starcluster security group but wasn't
configured to run jobs. If anyone has run into this behavior before and
knows how to prevent it from happening I'd appreciate the feedback as the
cost of ten big nodes adds up quickly =)

Thanks,
David

On Mon, Jun 8, 2015 at 12:04 PM Steve Darnell <darnells at dnastar.com> wrote:

>  Raj has commented in the past that the load balancer does not use the
> same logic as listclusters:
> http://star.mit.edu/cluster/mlarchives/2585.html
>
> --
>
>
>
> From that archived message:
>
>
>
> The elastic load balancer parses the output of 'qhost' on the cluster:
>
>
> https://github.com/jtriley/StarCluster/blob/develop/starcluster/balancers/sge/__init__.py#L59
>
>  I don't remember the exact reason for using that instead of the same
> logic as 'listclusters' above, but here's my guess a few years after the
> fact:
>
> - Avoids another remote API call to AWS' tagging service to retrieve the
> tags for all instances within an account. This needs to be called every
> minute, so a speedy call to your cluster instead of to a remote API is
> beneficial
>
> - qhost outputs the number of machines correctly configured and able to
> process work. If a machine shows up in 'listcluster' but not in 'qhost'
> it's likely not usable to process jobs, and would probably need manual
> cleanup.
>
> HTH
>
> Raj
>
>
>
> *From:* starcluster-bounces at mit.edu [mailto:starcluster-bounces at mit.edu] *On
> Behalf Of *David Koppstein
> *Sent:* Monday, June 08, 2015 10:24 AM
> *To:* starcluster at mit.edu
> *Subject:* Re: [StarCluster] load balancer stopped working?
>
>
>
> Edit:
>
> It appears that the load balancer thinks the cluster is not running, even
> though listclusters says it is and I can successfully login using
> sshmaster. Still can't figure out why this is the case.
>
>
>
> Apologies for the spam.
>
>
>
> ubuntu at ip-10-0-0-20:~$ /opt/venv/python2_venv/bin/python
> /opt/venv/python2_venv/bin/starcluster -c /home/ubuntu/.starcluster/config
> loadbalance -n 1 -m 20 -w 300 dragon-1.3.0
>
> StarCluster - (http://star.mit.edu/cluster) (v. 0.95.6)
>
> Software Tools for Academics and Researchers (STAR)
>
> Please submit bug reports to starcluster at mit.edu
>
>
>
> !!! ERROR - cluster dragon-1.3.0 is not running
>
> ubuntu at ip-10-0-0-20:~$ /opt/venv/python2_venv/bin/python
> /opt/venv/python2_venv/bin/starcluster -c /home/ubuntu/.starcluster/config
> listclusters
>
>
>
> -----------------------------------------------
>
> dragon-1.3.0 (security group: @sc-dragon-1.3.0)
>
> -----------------------------------------------
>
> Launch time: 2015-04-26 03:40:22
>
> Uptime: 43 days, 11:42:08
>
> VPC: vpc-849ec2e1
>
> Subnet: subnet-b6901fef
>
> Zone: us-east-1d
>
> Keypair: bean_key
>
> EBS volumes:
>
>     vol-34a33e73 on master:/dev/sdz (status: attached)
>
>     vol-dc7beb9b on master:/dev/sdx (status: attached)
>
>     vol-57148c10 on master:/dev/sdy (status: attached)
>
>     vol-8ba835cc on master:/dev/sdv (status: attached)
>
>     vol-9253ced5 on master:/dev/sdw (status: attached)
>
> Cluster nodes:
>
>      master running i-609aa79c 52.0.250.221
>
>     node002 running i-f4d6470b 52.4.102.101
>
>     node014 running i-52d6b2ad 52.7.159.255
>
>     node016 running i-fb9ae804 54.88.226.88
>
>     node017 running i-b275084d 52.5.86.254
>
>     node020 running i-14532eeb 52.5.111.191
>
>     node021 running i-874b3678 54.165.179.93
>
>     node022 running i-5abfc2a5 54.85.47.151
>
>     node023 running i-529ee3ad 52.1.197.60
>
>     node024 running i-0792eff8 54.172.58.21
>
> Total nodes: 10
>
>
>
> ubuntu at ip-10-0-0-20:~$ /opt/venv/python2_venv/bin/python
> /opt/venv/python2_venv/bin/starcluster -c /home/ubuntu/.starcluster/config
> sshmaster dragon-1.3.0
>
> StarCluster - (http://star.mit.edu/cluster) (v. 0.95.6)
>
> Software Tools for Academics and Researchers (STAR)
>
> Please submit bug reports to starcluster at mit.edu
>
>
>
> The authenticity of host '52.0.250.221 (52.0.250.221)' can't be
> established.
>
> ECDSA key fingerprint is e7:21:af:bf:2b:bf:c4:49:43:b8:dd:0b:aa:d3:81:a0.
>
> Are you sure you want to continue connecting (yes/no)? yes
>
> Warning: Permanently added '52.0.250.221' (ECDSA) to the list of known
> hosts.
>
>           _                 _           _
>
> __/\_____| |_ __ _ _ __ ___| |_   _ ___| |_ ___ _ __
>
> \    / __| __/ _` | '__/ __| | | | / __| __/ _ \ '__|
>
> /_  _\__ \ || (_| | | | (__| | |_| \__ \ ||  __/ |
>
>   \/ |___/\__\__,_|_|  \___|_|\__,_|___/\__\___|_|
>
>
>
> StarCluster Ubuntu 13.04 AMI
>
> Software Tools for Academics and Researchers (STAR)
>
> Homepage: http://star.mit.edu/cluster
>
> Documentation: http://star.mit.edu/cluster/docs/latest
>
> Code: https://github.com/jtriley/StarCluster
>
> Mailing list: http://star.mit.edu/cluster/mailinglist.html
>
>
>
> This AMI Contains:
>
>
>
>   * Open Grid Scheduler (OGS - formerly SGE) queuing system
>
>   * Condor workload management system
>
>   * OpenMPI compiled with Open Grid Scheduler support
>
>   * OpenBLAS - Highly optimized Basic Linear Algebra Routines
>
>   * NumPy/SciPy linked against OpenBlas
>
>   * Pandas - Data Analysis Library
>
>   * IPython 1.1.0 with parallel and notebook support
>
>   * Julia 0.3pre
>
>   * and more! (use 'dpkg -l' to show all installed packages)
>
>
>
> Open Grid Scheduler/Condor cheat sheet:
>
>
>
>   * qstat/condor_q - show status of batch jobs
>
>   * qhost/condor_status- show status of hosts, queues, and jobs
>
>   * qsub/condor_submit - submit batch jobs (e.g. qsub -cwd ./job.sh)
>
>   * qdel/condor_rm - delete batch jobs (e.g. qdel 7)
>
>   * qconf - configure Open Grid Scheduler system
>
>
>
> Current System Stats:
>
>
>
>   System load:  0.0                Processes:           226
>
>   Usage of /:   80.8% of 78.61GB   Users logged in:     2
>
>   Memory usage: 6%                 IP address for eth0: 10.0.0.213
>
>   Swap usage:   0%
>
>
>
>   => There are 2 zombie processes.
>
>
>
>     https://landscape.canonical.com/
>
> Last login: Sun Apr 26 04:50:46 2015 from
> c-24-60-255-35.hsd1.ma.comcast.net
>
> root at master:~#
>
>
>
>
>
>
>
> On Mon, Jun 8, 2015 at 11:10 AM David Koppstein <david.koppstein at gmail.com>
> wrote:
>
>  Hi,
>
>
>
> I noticed that my load balancer stopped working -- specifically, it has
> stopped deleting unnecessary nodes. It's been running fine for about three
> weeks.
>
>
>
> I have a small T2 micro instance loadbalancing a cluster of M3.xlarge. The
> cluster is running Ubuntu 14.04 using the shared 14.0. AMI ami-38b99850.
>
>
>
> The loadbalancer process is still running (started with nohup CMD &, where
> CMD is the loadbalancer command below):
>
>
>
> ```
>
> ubuntu at ip-10-0-0-20:~$ ps -ef | grep load
>
> ubuntu   11784 11730  0 15:04 pts/1    00:00:00 grep --color=auto load
>
> ubuntu   19493     1  0 Apr26 ?        01:25:03
> /opt/venv/python2_venv/bin/python /opt/venv/python2_venv/bin/starcluster -c
> /home/ubuntu/.starcluster/config loadbalance -n 1 -m 20 -w 300 dragon-1.3.0
>
> ```
>
>
>
> Queue has been empty for several days.
>
>
>
> ```
>
> dkoppstein at master:/dkoppstein/150521SG_v1.9_round2$ qstat -u "*"
>
> dkoppstein at master:/dkoppstein/150521SG_v1.9_round2$
>
> ```
>
>
>
> However, there are about 8 nodes that have been running over the weekend
> and are not being killed despite -n 1. If anyone has any guesses as to why
> the loadbalancer might stop working please let me know so I can prevent
> this from happening in the future.
>
>
>
> Thanks,
>
> David
>
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20150608/48b568b9/attachment-0001.htm