<html>
  <head>
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <div class="moz-cite-prefix">Sorry Raj, I don't mean to seem "cranky
      that the ELB isn't meeting your predictive load balancing needs". 
      It's clear that the released code doesn't do predictive load
      balancing, I'm just trying to get it working so that it can answer
      the question "Are there so many jobs that the queue would have
      items after 5 minutes" (your masters thesis, figure 9).<br>
      <br>
      Thanks for your help so far, and of course for writing and
      releasing the code in the first place.  I do hope I can get this
      finished and that users will find it useful even when job
      durations have a high variance (I'm relying on having lots of jobs
      running to bring the variance under control).<br>
      <br>
      <br>
      Tony<br>
      <br>
      On 04/04/16 04:37, Rajat Banerjee wrote:<br>
    </div>
    <blockquote
cite="mid:CAAEsPud-m1XzswtSuqgaF2DKrzfdgMjJZA+MqTvBAE+ebR0ZCQ@mail.gmail.com"
      type="cite">
      <meta http-equiv="Context-Type" content="text/html; charset=UTF-8">
      <div dir="ltr">Hi Tony,
        <div>Interesting findings. Yes that does seem like a bug with
          polling_interval. All of the logic to make a load balancing
          decision is done in _eval_add_node and doesn't consider
          average job duration. It was a design decision at the time:
          look at the queue and current slot count to see if throughput
          meets the parameter <span>longest_allowed_queue_time</span>,
          or whether we should try to predict job durations. We surveyed
          some of the most active users at that time and most said their
          job sizes were varying and unpredictable and not amenable to
          'prediction'. So we went with the former.</div>
        <div><br>
        </div>
        <div>Thanks for sharing your findings, but I'm not sure why you
          seem cranky that the ELB isn't meeting your predictive load
          balancing needs and has some bugs in unused code? It's open
          source for a reason, so people can adapt and improve it.</div>
        <div>best,</div>
        <div>Raj</div>
      </div>
      <div class="gmail_extra"><br>
        <div class="gmail_quote">On Sun, Apr 3, 2016 at 8:41 AM, Tony
          Robinson <span dir="ltr">&lt;<a moz-do-not-send="true"
              href="mailto:tonyr@speechmatics.com" target="_blank">tonyr@speechmatics.com</a>&gt;</span>
          wrote:<br>
          <blockquote class="gmail_quote">
            <div>
              <div>Okay, I've found another bug with the load balancer
                which explains why avg_job_duration() was getting
                shorter and shorter.<br>
                <br>
                get_qatime() initially loads the whole (3 hours)
                history, but after that sets  temp_lookback_window =
                self.polling_interval<br>
                <br>
                The problem with this is self.polling_interval has to be
                much shorter than a job duration (it's got to be able to
                keep up) and the -b option to qacct sets "The earliest
                start time for jobs to be summarized,", so it only
                selects jobs that have been started recently and
                finished (so that they get into qacct) - hence they must
                be the very short jobs.   Hence the cache is originally
                populated quite reasonably but then only gets updated
                with very short jobs, all the long ones never get into
                the cache.<br>
                <br>
                As I say below, I don't think any of this code is used
                anyway so it doesn't matter too much that it's all
                broken.<br>
                <br>
                I'll progress with my (weekend and part time) clean up
                and implementation of a true predictive load balancer. 
                I have both (a) mean and variance for all job types and
                (b) working code assuming that avg_job_duration() is
                correct, so it's probably only another days work to get
                solid (or a month or two of elapsed time, I'm done for
                this weekend).<span class="HOEnZb"><br>
                  <br>
                  <br>
                  Tony</span>
                <div>
                  <div class="h5"><br>
                    <br>
                    On 01/04/16 17:01, Tony Robinson wrote:<br>
                  </div>
                </div>
              </div>
              <div>
                <div class="h5">
                  <blockquote type="cite">
                    <div>On 01/04/16 16:22, Rajat Banerjee wrote:<br>
                    </div>
                    <blockquote type="cite">
                      <div dir="ltr">
                        <div>
                          <div>
                            <div>
                              <div>
                                <div>
                                  <div>Regarding:<br>
                                    How about we just call qacct every 5
                                    mins, or if the qacct buffer is
                                    empty. <br>
                                  </div>
                                  <div>calling qacct and getting the job
                                    stats is the first part of the load
                                    balancers loop to see what the
                                    cluster is up to. I prioritized
                                    knowing the current state, and
                                    keeping the LB running it's loop as
                                    fast as possible (2-10 seconds), so
                                    it could run in a 1-minute loop and
                                    stay roughly on-schedule. It's easy
                                    to run the whole LB loop with 5
                                    minutes between loops with the
                                    command line arg <span>polling_interval,
                                      if that suits your workload
                                      better. I do not mean to sound
                                      dismissive, but the command line
                                      options (with reasonable
                                      defaults)are there so you can test
                                      and tweak to your work load.<br>
                                    </span></div>
                                </div>
                              </div>
                            </div>
                          </div>
                        </div>
                      </div>
                    </blockquote>
                    <br>
                    Ah, I wasn't very clear.   What I mean is that we
                    only update the qacct stats every 5 minutes.   I run
                    the main loop every 30s.   <br>
                    <br>
                    But calling qacct doesn't' take any time - we could
                    do it every polling interval:<br>
                    <br>
                    root@master:~# date<br>
                    Fri Apr  1 16:54:31 BST 2016<br>
                    root@master:~# echo qacct -j -b `date
                    +%y%m%d`$((`date +%H` - 3))`date +%m`<br>
                    qacct -j -b 1604011304<br>
                    root@master:~# time  qacct -j -b `date
                    +%y%m%d`$((`date +%H` - 3))`date +%m` | wc<br>
                      99506  224476 3307423<br>
                    <br>
                    real    0m0.588s<br>
                    user    0m0.560s<br>
                    sys    0m0.076s<br>
                    root@master:~# <br>
                    <br>
                    <br>
                    If calling qacct is slow then the update could be
                    run at the end of the loop so it would have all of
                    the loop wait time to complete in.<br>
                    <br>
                    <blockquote type="cite">
                      <div dir="ltr">
                        <div>
                          <div>
                            <div>
                              <div>
                                <div>Regarding:<br>
                                </div>
                                Three sorts of jobs, all of which should
                                occur in the same numbers,<br>
                              </div>
                              Have you tried testing your call to qacct
                              to see if it's returning what you want?
                              You could modify it in your source if it's
                              not representative of your jobs:<br>
                              <a moz-do-not-send="true"
href="https://github.com/jtriley/StarCluster/blob/develop/starcluster/balancers/sge/__init__.py#L528"
                                target="_blank">https://github.com/jtriley/StarCluster/blob/develop/starcluster/balancers/sge/__init__.py#L528</a><br>
                              qacct_cmd <span>=</span> <span><span>'</span><tt>qacct
                                  -j -b </tt><span>'</span></span> <span>+</span>
                              qatime<br>
                            </div>
                          </div>
                        </div>
                      </div>
                    </blockquote>
                    <br>
                    Yes, thanks, I'm comparing to running qacct outside
                    of the load balancer.<br>
                    <br>
                    <blockquote type="cite">
                      <div dir="ltr">
                        <div>
                          <div>
                            <div>Obviously one size doesn't fit all
                              here, but if you find a set of args for
                              qacct that work better for you, let me
                              know.<br>
                            </div>
                          </div>
                        </div>
                      </div>
                    </blockquote>
                    <br>
                    At the moment I don't think that the output of qacct
                    is used at all is it?   I thought it was only used
                    to give job stats, I don't think it's really used to
                    bring nodes up/down.<br>
                    <br>
                    <br>
                    Tony<br>
                    <br>
                    <div>-- <br>
                      Speechmatics is a trading name of Cantab Research
                      Limited<br>
                      We are hiring: <a moz-do-not-send="true"
                        href="http:www.speechmatics.com/careers"
                        target="_blank">www.speechmatics.com/careers</a><br>
                      Dr A J Robinson, Founder, Cantab Research Ltd<br>
                      Phone direct: 01223 794096, office: 01223 794497<br>
                      Company reg no GB 05697423, VAT reg no 925606030<br>
                      51 Canterbury Street, Cambridge, CB4 3QG, UK<br>
                    </div>
                  </blockquote>
                  <br>
                  <br>
                  <div>-- <br>
                    Speechmatics is a trading name of Cantab Research
                    Limited<br>
                    We are hiring: <a moz-do-not-send="true"
                      href="https://www.speechmatics.com/careers"
                      target="_blank">www.speechmatics.com/careers</a><br>
                    Dr A J Robinson, Founder, Cantab Research Ltd<br>
                    Phone direct: 01223 794096, office: 01223 794497<br>
                    Company reg no GB 05697423, VAT reg no 925606030<br>
                    51 Canterbury Street, Cambridge, CB4 3QG, UK</div>
                </div>
              </div>
            </div>
            <br>
            _______________________________________________<br>
            StarCluster mailing list<br>
            <a moz-do-not-send="true" href="mailto:StarCluster@mit.edu">StarCluster@mit.edu</a><br>
            <a moz-do-not-send="true"
              href="http://mailman.mit.edu/mailman/listinfo/starcluster"
              rel="noreferrer" target="_blank">http://mailman.mit.edu/mailman/listinfo/starcluster</a><br>
            <br>
          </blockquote>
        </div>
        <br>
      </div>
    </blockquote>
    <br>
    <br>
    <div class="moz-signature">-- <br>
      Speechmatics is a trading name of Cantab Research Limited<br>
      We are hiring: <a href="https://www.speechmatics.com/careers">www.speechmatics.com/careers</a><br>
      Dr A J Robinson, Founder, Cantab Research Ltd<br>
      Phone direct: 01223 794096, office: 01223 794497<br>
      Company reg no GB 05697423, VAT reg no 925606030<br>
      51 Canterbury Street, Cambridge, CB4 3QG, UK</div>
  </body>
</html>