<div dir="ltr"><div>I see what you are saying but did not test the case with a range of taskid's , so I did not see the problem you mentioned. The cache may have been a premature optimization to avoid doing large pulls from jobstat once every 30-60 seconds. When Justin and I were designing it, it seemed wise to cache some amount of SGE's output instead of doing a full pull every time, and it got very slow when there were >100 jobs.<br><br></div>Raj<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Sat, Mar 26, 2016 at 3:29 PM, Tony Robinson <span dir="ltr"><<a href="mailto:tonyr@speechmatics.com" target="_blank">tonyr@speechmatics.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000">
<div>Following up (and hopefully not talking
to myself), I've found at least one problem with jobstats[]. The
code says:<br>
<br>
if l.find('jobnumber') != -1:<br>
job_id = int(l[13:len(l)])<br>
...<br>
hash = {'jobname': jobname, 'queued': qd, 'start':
start, 'end': end}<br>
self.jobstats[job_id % self.jobstat_cachesize] =
hash<br>
<br>
So it doesn't take into account array jobs which have a range of
taskid - it just counts one instance.<br>
<br>
That explains why the estimated job duration is wrong. The most
obvious solution is just to get rid of the cache. If compute
time is a problem, keep ru_wallclock as at the moment most time is
spent in converting time formats.<br>
<br>
I'm also working on a gridEngine scheduler that works well with
starcluster, that is it keeps the most recently booted nodes (and
master) the most loaded so when you get to nearly the end of the
hour there's nodes free you can take down. I've just got this
going. Next I'd like to distribute the load evenly across all
nodes that are up (these are vCPU, lightly loaded runs much
faster) unless they are near the end of the hour, and in that case
make sure the ones nearest the end are empty. I'm happy to go
into details but I fear there aren't that many users of
starcluster who really care about getting things going
efficiently for short running jobs (or the above bug would have
been fixed) so I'm talking to myself.<span class="HOEnZb"><font color="#888888"><br>
<br>
<br>
Tony</font></span><div><div class="h5"><br>
<br>
On 25/03/16 19:56, Tony Robinson wrote:<br>
</div></div></div><div><div class="h5">
<blockquote type="cite">
<div>Hi Rajat,<br>
<br>
The main issue that I have with the load balancer is sometimes
bringing up a node or taking down a node fails and this caused
the loadbalancer to fall over. This is almost certainly an
issue with boto - I just haven't looked into it enough.<br>
<br>
I'm working on the loadbalancer right now. I'm running a few
different sorts of jobs, some take half a minute some take five
minutes. It takes me about five minutes to bring a node up, so
load balancing is quite a hard task, certainly what's there at
the moment isn't optimal.<br>
<br>
In your masters thesis you had a go at anticipating the future
load based on the queue, although I see no trace of this in the
current code. What seems like the most obvious approach to me
is to look at what's running and in the queue and see if it's
all going to complete within some specified period. If it is,
then fine, if not assume you are going to bring n nodes up
(start at n=1) and then see if it'll complete, if not then
increment n.<br>
<br>
I've got a version of this running but it isn't completed
because avg_job_duration() consistently under reports. I'm
doing some debugging, it seems that jobstats[] has a bug, I have
three type of job, a start, middle and end, and as they are all
run in sequence then jobstats[] should have equal numbers of
each. It doesn't.<br>
<br>
This is a weekend (with unreliable time) activity for me. If
you or anyone else wants to help:<br>
<br>
a) getting avg_job_duration() working which probably means
fixing jobstats[]<br>
b) getting a clean simple predictive load balancer working<br>
<br>
then please contact me.<br>
<br>
<br>
Tony<br>
<br>
On 25/03/16 17:17, Rajat Banerjee wrote:<br>
</div>
<blockquote type="cite">I'll fix any issues with the load balancer if they
come up.</blockquote>
<br>
<br>
<div>-- <br>
Speechmatics is a trading name of Cantab Research Limited<br>
We are hiring: <a href="https://www.speechmatics.com/careers" target="_blank">www.speechmatics.com/careers</a><br>
Dr A J Robinson, Founder, Cantab Research Ltd<br>
Phone direct: 01223 794096, office: 01223 794497<br>
Company reg no GB 05697423, VAT reg no 925606030<br>
51 Canterbury Street, Cambridge, CB4 3QG, UK</div>
</blockquote>
<br>
<br>
<div>-- <br>
Speechmatics is a trading name of Cantab Research Limited<br>
We are hiring: <a href="https://www.speechmatics.com/careers" target="_blank">www.speechmatics.com/careers</a><br>
Dr A J Robinson, Founder, Cantab Research Ltd<br>
Phone direct: 01223 794096, office: 01223 794497<br>
Company reg no GB 05697423, VAT reg no 925606030<br>
51 Canterbury Street, Cambridge, CB4 3QG, UK</div>
</div></div></div>
<br>_______________________________________________<br>
StarCluster mailing list<br>
<a href="mailto:StarCluster@mit.edu">StarCluster@mit.edu</a><br>
<a href="http://mailman.mit.edu/mailman/listinfo/starcluster" rel="noreferrer" target="_blank">http://mailman.mit.edu/mailman/listinfo/starcluster</a><br>
<br></blockquote></div><br></div>