[StarCluster] loadbalance

Rajat Banerjee rajatb at post.harvard.edu
Wed Sep 18 15:01:25 EDT 2013


That looks normal. Can you please send qstat, qacct, and qhost output from
when you're seeing the problem? Thanks.


On Wed, Sep 18, 2013 at 2:47 PM, Ryan Golhar <ngsbioinformatics at gmail.com>wrote:

> I've since terminated the cluster and an experimenting with different set
> up, but here's the output from qstat and qhost;
>
> ec2-user at master:~$ qstat
> job-ID  prior   name       user         state submit/start at     queue
>                        slots ja-task-ID
>
> -----------------------------------------------------------------------------------------------------------------
>       4 0.55500 j1-00493-0 ec2-user     r     09/18/2013 17:38:44
> all.q at node001                      8
>       6 0.55500 j1-00508-0 ec2-user     r     09/18/2013 17:45:44
> all.q at node002                      8
>       7 0.55500 j1-00525-0 ec2-user     r     09/18/2013 17:46:29
> all.q at node003                      8
>       8 0.55500 j1-00541-0 ec2-user     r     09/18/2013 17:54:59
> all.q at node004                      8
>       9 0.55500 j1-00565-0 ec2-user     r     09/18/2013 17:55:44
> all.q at node005                      8
>      10 0.55500 j1-00596-0 ec2-user     r     09/18/2013 17:58:59
> all.q at node006                      8
>      11 0.55500 j1-00604-0 ec2-user     r     09/18/2013 18:05:14
> all.q at node007                      8
>      12 0.55500 j1-00625-0 ec2-user     r     09/18/2013 18:05:14
> all.q at node008                      8
>      13 0.55500 j1-00650-0 ec2-user     r     09/18/2013 18:05:14
> all.q at node009                      8
>      18 0.55500 j1-00734-0 ec2-user     r     09/18/2013 18:07:29
> all.q at node010                      8
>      19 0.55500 j1-00738-0 ec2-user     r     09/18/2013 18:16:59
> all.q at node011                      8
>      20 0.55500 j1-00739-0 ec2-user     r     09/18/2013 18:16:59
> all.q at node012                      8
>      21 0.55500 j1-00770   ec2-user     r     09/18/2013 18:16:59
> all.q at node013                      8
>      22 0.55500 j1-00806-0 ec2-user     r     09/18/2013 18:16:59
> all.q at node014                      8
>      23 0.55500 j1-00825-0 ec2-user     r     09/18/2013 18:16:59
> all.q at node015                      8
>      24 0.55500 j1-00826-0 ec2-user     r     09/18/2013 18:16:59
> all.q at node016                      8
>      25 0.55500 j1-00846-0 ec2-user     r     09/18/2013 18:16:59
> all.q at node017                      8
>      26 0.55500 j1-00847-0 ec2-user     r     09/18/2013 18:16:59
> all.q at node018                      8
>      27 0.55500 j1-00913   ec2-user     r     09/18/2013 18:16:59
> all.q at node019                      8
>      28 0.55500 j1-00914-0 ec2-user     r     09/18/2013 18:16:59
> all.q at node020                      8
>      29 0.55500 j1-00914   ec2-user     r     09/18/2013 18:26:29
> all.q at node021                      8
>      30 0.55500 j1-00922   ec2-user     r     09/18/2013 18:26:29
> all.q at node022                      8
>      31 0.55500 j1-00977   ec2-user     r     09/18/2013 18:26:29
> all.q at node023                      8
>      32 0.55500 j1-00984-0 ec2-user     r     09/18/2013 18:26:29
> all.q at node024                      8
>      33 0.55500 j1-00984   ec2-user     r     09/18/2013 18:26:29
> all.q at node025                      8
>      34 0.55500 j1-00998-0 ec2-user     r     09/18/2013 18:26:29
> all.q at node026                      8
>      35 0.55500 j1-01010-0 ec2-user     r     09/18/2013 18:26:29
> all.q at node027                      8
>      36 0.55500 j1-01019-0 ec2-user     r     09/18/2013 18:26:29
> all.q at node028                      8
>      37 0.55500 j1-01025-0 ec2-user     r     09/18/2013 18:26:29
> all.q at node029                      8
>      38 0.55500 j1-01026-0 ec2-user     r     09/18/2013 18:26:29
> all.q at node030                      8
>
> ec2-user at master:~$ qhost
> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO
>  SWAPUS
>
> -------------------------------------------------------------------------------
> global                  -               -     -       -       -       -
>     -
> node001                 linux-x64       8  7.74    6.8G    3.8G     0.0
>   0.0
> node002                 linux-x64       8  7.93    6.8G    3.7G     0.0
>   0.0
> node003                 linux-x64       8  7.68    6.8G    3.7G     0.0
>   0.0
> node004                 linux-x64       8  7.86    6.8G    3.8G     0.0
>   0.0
> node005                 linux-x64       8  7.87    6.8G    3.7G     0.0
>   0.0
> node006                 linux-x64       8  7.66    6.8G    3.7G     0.0
>   0.0
> node007                 linux-x64       8  0.01    6.8G  564.8M     0.0
>   0.0
> node008                 linux-x64       8  0.01    6.8G  493.6M     0.0
>   0.0
> node009                 linux-x64       8  0.02    6.8G  564.4M     0.0
>   0.0
> node010                 linux-x64       8  7.85    6.8G    3.7G     0.0
>   0.0
> node011                 linux-x64       8  7.53    6.8G    3.7G     0.0
>   0.0
> node012                 linux-x64       8  7.57    6.8G    3.6G     0.0
>   0.0
> node013                 linux-x64       8  7.71    6.8G    3.7G     0.0
>   0.0
> node014                 linux-x64       8  7.49    6.8G    3.7G     0.0
>   0.0
> node015                 linux-x64       8  7.51    6.8G    3.7G     0.0
>   0.0
> node016                 linux-x64       8  7.50    6.8G    3.6G     0.0
>   0.0
> node017                 linux-x64       8  7.89    6.8G    3.7G     0.0
>   0.0
> node018                 linux-x64       8  7.50    6.8G    3.7G     0.0
>   0.0
> node019                 linux-x64       8  7.52    6.8G    3.7G     0.0
>   0.0
> node020                 linux-x64       8  7.68    6.8G    3.6G     0.0
>   0.0
> node021                 linux-x64       8  7.16    6.8G    3.6G     0.0
>   0.0
> node022                 linux-x64       8  6.99    6.8G    3.6G     0.0
>   0.0
> node023                 linux-x64       8  6.80    6.8G    3.6G     0.0
>   0.0
> node024                 linux-x64       8  7.20    6.8G    3.6G     0.0
>   0.0
> node025                 linux-x64       8  6.86    6.8G    3.6G     0.0
>   0.0
> node026                 linux-x64       8  7.24    6.8G    3.6G     0.0
>   0.0
> node027                 linux-x64       8  6.88    6.8G    3.7G     0.0
>   0.0
> node028                 linux-x64       8  6.28    6.8G    3.6G     0.0
>   0.0
> node029                 linux-x64       8  7.42    6.8G    3.6G     0.0
>   0.0
> node030                 linux-x64       8  0.10    6.8G  390.4M     0.0
>   0.0
> node031                 linux-x64       8  0.06    6.8G  135.0M     0.0
>   0.0
> node032                 linux-x64       8  0.04    6.8G  135.3M     0.0
>   0.0
> node033                 linux-x64       8  0.07    6.8G  135.6M     0.0
>   0.0
> node034                 linux-x64       8  0.10    6.8G  134.9M     0.0
>   0.0
>
>
> I never saw anything unusual
>
>
> On Wed, Sep 18, 2013 at 10:40 AM, Rajat Banerjee <rajatb at post.harvard.edu>wrote:
>
>> Ryan,
>> Could you put the output of qhost and qstat into a text file and send it
>> back to the list? That's what feeds the load balancer those stats.
>>
>> Thanks,
>> Rajat
>>
>>
>> On Tue, Sep 17, 2013 at 11:47 PM, Ryan Golhar <
>> ngsbioinformatics at gmail.com> wrote:
>>
>>> I'm running a cluster with over 800 jobs queued....and I'm running
>>> loadbalance.  Every other query by loadbalance shows Avg job duration and
>>> wait time of 0 secs.  Why is this?  It hasn't yet caused a problem, but
>>> seems odd....
>>>
>>> >>> Loading full job history
>>> Execution hosts: 19
>>> Queued jobs: 791
>>> Oldest queued job: 2013-09-17 22:19:23
>>> Avg job duration: 3559 secs
>>> Avg job wait time: 12389 secs
>>> Last cluster modification time: 2013-09-18 00:11:31
>>> >>> Not adding nodes: already at or above maximum (1)
>>> >>> Sleeping...(looping again in 60 secs)
>>>
>>> Execution hosts: 19
>>> Queued jobs: 791
>>> Oldest queued job: 2013-09-17 22:19:23
>>> Avg job duration: 0 secs
>>> Avg job wait time: 0 secs
>>> Last cluster modification time: 2013-09-18 00:11:31
>>> >>> Not adding nodes: already at or above maximum (1)
>>> >>> Sleeping...(looping again in 60 secs)
>>>
>>>
>>>
>>> _______________________________________________
>>> StarCluster mailing list
>>> StarCluster at mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>
>>>
>>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster at mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20130918/ad61b7c3/attachment-0001.htm


More information about the StarCluster mailing list