<div dir="ltr"><div>I see what you are saying but did not test the case with a range of taskid&#39;s , so I did not see the problem you mentioned. The cache may have been a premature optimization to avoid doing large pulls from jobstat once every 30-60 seconds. When Justin and I were designing it, it seemed wise to cache some amount of SGE&#39;s output instead of doing a full pull every time, and it got very slow when there were &gt;100 jobs.<br><br></div>Raj<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Sat, Mar 26, 2016 at 3:29 PM, Tony Robinson <span dir="ltr">&lt;<a href="mailto:tonyr@speechmatics.com" target="_blank">tonyr@speechmatics.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

  
  <div bgcolor="#FFFFFF" text="#000000">

    <div>Following up (and hopefully not talking

      to myself), I&#39;ve found at least one problem with jobstats[].   The

      code says:<br>

      <br>

                 if l.find(&#39;jobnumber&#39;) != -1:<br>

                      job_id = int(l[13:len(l)])<br>

                ...<br>

                hash = {&#39;jobname&#39;: jobname, &#39;queued&#39;: qd, &#39;start&#39;:

      start, &#39;end&#39;: end}<br>

                     self.jobstats[job_id % self.jobstat_cachesize] =

      hash<br>

      <br>

      So it doesn&#39;t take into account array jobs which have a range of

      taskid - it just counts one instance.<br>

      <br>

      That explains why the estimated job duration is wrong.   The most

      obvious solution is just to get rid of the cache.   If compute

      time is a problem, keep ru_wallclock as at the moment most time is

      spent in converting time formats.<br>

      <br>

      I&#39;m also working on a gridEngine scheduler that works well with

      starcluster, that is it keeps the most recently booted nodes (and

      master) the most loaded so when you get to nearly the end of the

      hour there&#39;s nodes free you can take down.   I&#39;ve just got this

      going.   Next I&#39;d like to distribute the load evenly across all

      nodes that are up (these are vCPU, lightly loaded runs much

      faster) unless they are near the end of the hour, and in that case

      make sure the ones nearest the end are empty.   I&#39;m happy to go

      into details but I fear there aren&#39;t that many users of

      starcluster who really care about  getting things going

      efficiently for short running jobs (or the above bug would have

      been fixed) so I&#39;m talking to myself.<span class="HOEnZb"><font color="#888888"><br>

      <br>

      <br>

      Tony</font></span><div><div class="h5"><br>

      <br>

      On 25/03/16 19:56, Tony Robinson wrote:<br>

    </div></div></div><div><div class="h5">

    <blockquote type="cite">

      
      <div>Hi Rajat,<br>

        <br>

        The main issue that I have with the load balancer is sometimes

        bringing up a node or taking down a node fails and this caused

        the loadbalancer to fall over.   This is almost certainly an

        issue with boto - I just haven&#39;t looked into it enough.<br>

        <br>

        I&#39;m working on the loadbalancer right now.   I&#39;m running a few

        different sorts of jobs, some take half a minute some take five

        minutes.   It takes me about five minutes to bring a node up, so

        load balancing is quite a hard task, certainly what&#39;s there at

        the moment isn&#39;t optimal.<br>

        <br>

        In your masters thesis you had a go at anticipating the future

        load based on the queue, although I see no trace of this in the

        current code.   What seems like the most obvious approach to me

        is to look at what&#39;s running and in the queue and see if it&#39;s

        all going to complete within some specified period.   If it is,

        then fine, if not assume you are going to bring n nodes up

        (start at n=1) and then see if it&#39;ll complete, if not then

        increment n.<br>

        <br>

        I&#39;ve got a version of this running but it isn&#39;t completed

        because avg_job_duration() consistently under reports.   I&#39;m

        doing some debugging, it seems that jobstats[] has a bug, I have

        three type of job, a start, middle and end, and as they are all

        run in sequence then jobstats[] should have equal numbers of

        each.   It doesn&#39;t.<br>

        <br>

        This is a weekend (with unreliable time) activity for me.   If

        you or anyone else wants to help:<br>

        <br>

        a) getting avg_job_duration() working  which probably means

        fixing jobstats[]<br>

        b) getting a clean simple predictive load balancer working<br>

        <br>

        then please contact me.<br>

        <br>

        <br>

        Tony<br>

        <br>

        On 25/03/16 17:17, Rajat Banerjee wrote:<br>

      </div>

      <blockquote type="cite">I&#39;ll fix any issues with the load balancer if they

        come up.</blockquote>

      <br>

      <br>

      <div>-- <br>

        Speechmatics is a trading name of Cantab Research Limited<br>

        We are hiring: <a href="https://www.speechmatics.com/careers" target="_blank">www.speechmatics.com/careers</a><br>

        Dr A J Robinson, Founder, Cantab Research Ltd<br>

        Phone direct: 01223 794096, office: 01223 794497<br>

        Company reg no GB 05697423, VAT reg no 925606030<br>

        51 Canterbury Street, Cambridge, CB4 3QG, UK</div>

    </blockquote>

    <br>

    <br>

    <div>-- <br>

      Speechmatics is a trading name of Cantab Research Limited<br>

      We are hiring: <a href="https://www.speechmatics.com/careers" target="_blank">www.speechmatics.com/careers</a><br>

      Dr A J Robinson, Founder, Cantab Research Ltd<br>

      Phone direct: 01223 794096, office: 01223 794497<br>

      Company reg no GB 05697423, VAT reg no 925606030<br>

      51 Canterbury Street, Cambridge, CB4 3QG, UK</div>

  </div></div></div>


<br>_______________________________________________<br>

StarCluster mailing list<br>

<a href="mailto:StarCluster@mit.edu">StarCluster@mit.edu</a><br>

<a href="http://mailman.mit.edu/mailman/listinfo/starcluster" rel="noreferrer" target="_blank">http://mailman.mit.edu/mailman/listinfo/starcluster</a><br>

<br></blockquote></div><br></div>