[StarCluster] loadbalance

Ryan Golhar ngsbioinformatics at gmail.com
Thu Sep 19 01:51:42 EDT 2013


Its happening again.

output from qstat (truncated):

job-ID  prior   name       user         state submit/start at     queue
                     slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
   1211 0.55500 j1-00596-0 ec2-user     r     09/19/2013 05:27:35
all.q at node003                      8
   1212 0.55500 j1-00604-0 ec2-user     r     09/19/2013 05:27:35
all.q at node005                      8
   1214 0.55500 j1-00650-0 ec2-user     r     09/19/2013 05:27:35
all.q at node002                      8
   1215 0.55500 j1-00984-0 ec2-user     r     09/19/2013 05:27:35
all.q at node009                      8
   1216 0.55500 j1-01025-0 ec2-user     r     09/19/2013 05:27:35
all.q at node017                      8
   1217 0.55500 j1-01026-0 ec2-user     r     09/19/2013 05:27:35
all.q at node010                      8
   1218 0.55500 j1-01026   ec2-user     r     09/19/2013 05:27:35
all.q at node006                      8
   1219 0.55500 j1-01053   ec2-user     r     09/19/2013 05:27:35
all.q at node007                      8
   1220 0.55500 j1-01106-0 ec2-user     r     09/19/2013 05:27:35
all.q at node012                      8
   1221 0.55500 j1-01119-0 ec2-user     r     09/19/2013 05:27:35
all.q at node016                      8
   1222 0.55500 j1-01175   ec2-user     r     09/19/2013 05:27:35
all.q at node020                      8
   1223 0.55500 j1-01178   ec2-user     r     09/19/2013 05:27:35
all.q at node018                      8
   1224 0.55500 j1-01184-0 ec2-user     r     09/19/2013 05:27:35
all.q at node019                      8
   1225 0.55500 j1-01184-0 ec2-user     r     09/19/2013 05:27:35
all.q at node015                      8
   1226 0.55500 j1-01184   ec2-user     r     09/19/2013 05:27:35
all.q at node014                      8
   1227 0.55500 j1-01190-0 ec2-user     r     09/19/2013 05:27:35
all.q at node011                      8
   1228 0.55500 j1-01190-0 ec2-user     r     09/19/2013 05:27:35
all.q at master                       8
   1229 0.55500 j1-01190   ec2-user     r     09/19/2013 05:27:35
all.q at node013                      8
   1230 0.55500 j1-01244-0 ec2-user     r     09/19/2013 05:27:35
all.q at node008                      8
   1231 0.55500 j1-01244-0 ec2-user     r     09/19/2013 05:27:35
all.q at node001                      8
   1232 0.55500 j1-01244   ec2-user     r     09/19/2013 05:45:05
all.q at node004                      8
   1233 0.55500 j1-01260-0 ec2-user     qw    09/19/2013 05:27:28
                         8
   1234 0.55500 j1-01260-0 ec2-user     qw    09/19/2013 05:27:28
                         8
   1235 0.55500 j1-01260   ec2-user     qw    09/19/2013 05:27:28
                         8
   1236 0.55500 j1-01265-0 ec2-user     qw    09/19/2013 05:27:28
                         8
   1237 0.55500 j1-01265-0 ec2-user     qw    09/19/2013 05:27:28
                         8
   1238 0.55500 j1-01265   ec2-user     qw    09/19/2013 05:27:28
                         8
   1239 0.55500 j1-01272-0 ec2-user     qw    09/19/2013 05:27:28
                         8

qacct:
Total System Usage
    WALLCLOCK         UTIME         STIME           CPU             MEMORY
                IO                IOW
================================================================================================================
        20647     12987.584      6567.578     25492.947          35430.706
          1771.872              0.000

qhost:
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO
 SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -
  -
master                  linux-x64       8 10.76    6.8G    3.8G     0.0
0.0
node001                 linux-x64       8  0.03    6.8G  537.4M     0.0
0.0
node002                 linux-x64       8  0.07    6.8G  539.7M     0.0
0.0
node003                 linux-x64       8  4.27    6.8G    3.7G     0.0
0.0
node004                 linux-x64       8  3.52    6.8G  283.0M     0.0
0.0
node005                 linux-x64       8  4.36    6.8G    3.7G     0.0
0.0
node006                 linux-x64       8  0.04    6.8G  642.3M     0.0
0.0
node007                 linux-x64       8  0.12    6.8G  468.3M     0.0
0.0
node008                 linux-x64       8  4.70    6.8G    3.8G     0.0
0.0
node009                 linux-x64       8  0.04    6.8G  607.2M     0.0
0.0
node010                 linux-x64       8  4.31    6.8G    3.7G     0.0
0.0
node011                 linux-x64       8  3.96    6.8G    3.6G     0.0
0.0
node012                 linux-x64       8  1.61    6.8G  280.4M     0.0
0.0
node013                 linux-x64       8  1.31    6.8G  582.7M     0.0
0.0
node014                 linux-x64       8  1.27    6.8G  375.2M     0.0
0.0
node015                 linux-x64       8  1.19    6.8G  996.4M     0.0
0.0
node016                 linux-x64       8  1.43    6.8G  349.8M     0.0
0.0
node017                 linux-x64       8  1.40    6.8G  567.0M     0.0
0.0
node018                 linux-x64       8  1.36    6.8G  262.4M     0.0
0.0
node019                 linux-x64       8  1.43    6.8G  278.5M     0.0
0.0
node020                 linux-x64       8  1.36    6.8G  402.6M     0.0
0.0



On Wed, Sep 18, 2013 at 3:01 PM, Rajat Banerjee <rajatb at post.harvard.edu>wrote:

> That looks normal. Can you please send qstat, qacct, and qhost output from
> when you're seeing the problem? Thanks.
>
>
> On Wed, Sep 18, 2013 at 2:47 PM, Ryan Golhar <ngsbioinformatics at gmail.com>wrote:
>
>> I've since terminated the cluster and an experimenting with different set
>> up, but here's the output from qstat and qhost;
>>
>> ec2-user at master:~$ qstat
>> job-ID  prior   name       user         state submit/start at     queue
>>                        slots ja-task-ID
>>
>> -----------------------------------------------------------------------------------------------------------------
>>       4 0.55500 j1-00493-0 ec2-user     r     09/18/2013 17:38:44
>> all.q at node001                      8
>>       6 0.55500 j1-00508-0 ec2-user     r     09/18/2013 17:45:44
>> all.q at node002                      8
>>       7 0.55500 j1-00525-0 ec2-user     r     09/18/2013 17:46:29
>> all.q at node003                      8
>>       8 0.55500 j1-00541-0 ec2-user     r     09/18/2013 17:54:59
>> all.q at node004                      8
>>       9 0.55500 j1-00565-0 ec2-user     r     09/18/2013 17:55:44
>> all.q at node005                      8
>>      10 0.55500 j1-00596-0 ec2-user     r     09/18/2013 17:58:59
>> all.q at node006                      8
>>      11 0.55500 j1-00604-0 ec2-user     r     09/18/2013 18:05:14
>> all.q at node007                      8
>>      12 0.55500 j1-00625-0 ec2-user     r     09/18/2013 18:05:14
>> all.q at node008                      8
>>      13 0.55500 j1-00650-0 ec2-user     r     09/18/2013 18:05:14
>> all.q at node009                      8
>>      18 0.55500 j1-00734-0 ec2-user     r     09/18/2013 18:07:29
>> all.q at node010                      8
>>      19 0.55500 j1-00738-0 ec2-user     r     09/18/2013 18:16:59
>> all.q at node011                      8
>>      20 0.55500 j1-00739-0 ec2-user     r     09/18/2013 18:16:59
>> all.q at node012                      8
>>      21 0.55500 j1-00770   ec2-user     r     09/18/2013 18:16:59
>> all.q at node013                      8
>>      22 0.55500 j1-00806-0 ec2-user     r     09/18/2013 18:16:59
>> all.q at node014                      8
>>      23 0.55500 j1-00825-0 ec2-user     r     09/18/2013 18:16:59
>> all.q at node015                      8
>>      24 0.55500 j1-00826-0 ec2-user     r     09/18/2013 18:16:59
>> all.q at node016                      8
>>      25 0.55500 j1-00846-0 ec2-user     r     09/18/2013 18:16:59
>> all.q at node017                      8
>>      26 0.55500 j1-00847-0 ec2-user     r     09/18/2013 18:16:59
>> all.q at node018                      8
>>      27 0.55500 j1-00913   ec2-user     r     09/18/2013 18:16:59
>> all.q at node019                      8
>>      28 0.55500 j1-00914-0 ec2-user     r     09/18/2013 18:16:59
>> all.q at node020                      8
>>      29 0.55500 j1-00914   ec2-user     r     09/18/2013 18:26:29
>> all.q at node021                      8
>>      30 0.55500 j1-00922   ec2-user     r     09/18/2013 18:26:29
>> all.q at node022                      8
>>      31 0.55500 j1-00977   ec2-user     r     09/18/2013 18:26:29
>> all.q at node023                      8
>>      32 0.55500 j1-00984-0 ec2-user     r     09/18/2013 18:26:29
>> all.q at node024                      8
>>      33 0.55500 j1-00984   ec2-user     r     09/18/2013 18:26:29
>> all.q at node025                      8
>>      34 0.55500 j1-00998-0 ec2-user     r     09/18/2013 18:26:29
>> all.q at node026                      8
>>      35 0.55500 j1-01010-0 ec2-user     r     09/18/2013 18:26:29
>> all.q at node027                      8
>>      36 0.55500 j1-01019-0 ec2-user     r     09/18/2013 18:26:29
>> all.q at node028                      8
>>      37 0.55500 j1-01025-0 ec2-user     r     09/18/2013 18:26:29
>> all.q at node029                      8
>>      38 0.55500 j1-01026-0 ec2-user     r     09/18/2013 18:26:29
>> all.q at node030                      8
>>
>> ec2-user at master:~$ qhost
>> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO
>>  SWAPUS
>>
>> -------------------------------------------------------------------------------
>> global                  -               -     -       -       -       -
>>     -
>> node001                 linux-x64       8  7.74    6.8G    3.8G     0.0
>>   0.0
>> node002                 linux-x64       8  7.93    6.8G    3.7G     0.0
>>   0.0
>> node003                 linux-x64       8  7.68    6.8G    3.7G     0.0
>>   0.0
>> node004                 linux-x64       8  7.86    6.8G    3.8G     0.0
>>   0.0
>> node005                 linux-x64       8  7.87    6.8G    3.7G     0.0
>>   0.0
>> node006                 linux-x64       8  7.66    6.8G    3.7G     0.0
>>   0.0
>> node007                 linux-x64       8  0.01    6.8G  564.8M     0.0
>>   0.0
>> node008                 linux-x64       8  0.01    6.8G  493.6M     0.0
>>   0.0
>> node009                 linux-x64       8  0.02    6.8G  564.4M     0.0
>>   0.0
>> node010                 linux-x64       8  7.85    6.8G    3.7G     0.0
>>   0.0
>> node011                 linux-x64       8  7.53    6.8G    3.7G     0.0
>>   0.0
>> node012                 linux-x64       8  7.57    6.8G    3.6G     0.0
>>   0.0
>> node013                 linux-x64       8  7.71    6.8G    3.7G     0.0
>>   0.0
>> node014                 linux-x64       8  7.49    6.8G    3.7G     0.0
>>   0.0
>> node015                 linux-x64       8  7.51    6.8G    3.7G     0.0
>>   0.0
>> node016                 linux-x64       8  7.50    6.8G    3.6G     0.0
>>   0.0
>> node017                 linux-x64       8  7.89    6.8G    3.7G     0.0
>>   0.0
>> node018                 linux-x64       8  7.50    6.8G    3.7G     0.0
>>   0.0
>> node019                 linux-x64       8  7.52    6.8G    3.7G     0.0
>>   0.0
>> node020                 linux-x64       8  7.68    6.8G    3.6G     0.0
>>   0.0
>> node021                 linux-x64       8  7.16    6.8G    3.6G     0.0
>>   0.0
>> node022                 linux-x64       8  6.99    6.8G    3.6G     0.0
>>   0.0
>> node023                 linux-x64       8  6.80    6.8G    3.6G     0.0
>>   0.0
>> node024                 linux-x64       8  7.20    6.8G    3.6G     0.0
>>   0.0
>> node025                 linux-x64       8  6.86    6.8G    3.6G     0.0
>>   0.0
>> node026                 linux-x64       8  7.24    6.8G    3.6G     0.0
>>   0.0
>> node027                 linux-x64       8  6.88    6.8G    3.7G     0.0
>>   0.0
>> node028                 linux-x64       8  6.28    6.8G    3.6G     0.0
>>   0.0
>> node029                 linux-x64       8  7.42    6.8G    3.6G     0.0
>>   0.0
>> node030                 linux-x64       8  0.10    6.8G  390.4M     0.0
>>   0.0
>> node031                 linux-x64       8  0.06    6.8G  135.0M     0.0
>>   0.0
>> node032                 linux-x64       8  0.04    6.8G  135.3M     0.0
>>   0.0
>> node033                 linux-x64       8  0.07    6.8G  135.6M     0.0
>>   0.0
>> node034                 linux-x64       8  0.10    6.8G  134.9M     0.0
>>   0.0
>>
>>
>> I never saw anything unusual
>>
>>
>> On Wed, Sep 18, 2013 at 10:40 AM, Rajat Banerjee <rajatb at post.harvard.edu
>> > wrote:
>>
>>> Ryan,
>>> Could you put the output of qhost and qstat into a text file and send it
>>> back to the list? That's what feeds the load balancer those stats.
>>>
>>> Thanks,
>>> Rajat
>>>
>>>
>>> On Tue, Sep 17, 2013 at 11:47 PM, Ryan Golhar <
>>> ngsbioinformatics at gmail.com> wrote:
>>>
>>>> I'm running a cluster with over 800 jobs queued....and I'm running
>>>> loadbalance.  Every other query by loadbalance shows Avg job duration and
>>>> wait time of 0 secs.  Why is this?  It hasn't yet caused a problem, but
>>>> seems odd....
>>>>
>>>> >>> Loading full job history
>>>> Execution hosts: 19
>>>> Queued jobs: 791
>>>> Oldest queued job: 2013-09-17 22:19:23
>>>> Avg job duration: 3559 secs
>>>> Avg job wait time: 12389 secs
>>>> Last cluster modification time: 2013-09-18 00:11:31
>>>> >>> Not adding nodes: already at or above maximum (1)
>>>> >>> Sleeping...(looping again in 60 secs)
>>>>
>>>> Execution hosts: 19
>>>> Queued jobs: 791
>>>> Oldest queued job: 2013-09-17 22:19:23
>>>> Avg job duration: 0 secs
>>>> Avg job wait time: 0 secs
>>>> Last cluster modification time: 2013-09-18 00:11:31
>>>> >>> Not adding nodes: already at or above maximum (1)
>>>> >>> Sleeping...(looping again in 60 secs)
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> StarCluster mailing list
>>>> StarCluster at mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>>
>>>>
>>>
>>
>> _______________________________________________
>> StarCluster mailing list
>> StarCluster at mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20130919/42b13cac/attachment-0001.htm


More information about the StarCluster mailing list