[StarCluster] Is StarCluster still under active development?

Rajat Banerjee rajatb at post.harvard.edu
Fri Apr 1 10:44:14 EDT 2016


I see what you are saying but did not test the case with a range of
taskid's , so I did not see the problem you mentioned. The cache may have
been a premature optimization to avoid doing large pulls from jobstat once
every 30-60 seconds. When Justin and I were designing it, it seemed wise to
cache some amount of SGE's output instead of doing a full pull every time,
and it got very slow when there were >100 jobs.

Raj

On Sat, Mar 26, 2016 at 3:29 PM, Tony Robinson <tonyr at speechmatics.com>
wrote:

> Following up (and hopefully not talking to myself), I've found at least
> one problem with jobstats[].   The code says:
>
>            if l.find('jobnumber') != -1:
>                 job_id = int(l[13:len(l)])
>           ...
>           hash = {'jobname': jobname, 'queued': qd, 'start': start, 'end':
> end}
>                self.jobstats[job_id % self.jobstat_cachesize] = hash
>
> So it doesn't take into account array jobs which have a range of taskid -
> it just counts one instance.
>
> That explains why the estimated job duration is wrong.   The most obvious
> solution is just to get rid of the cache.   If compute time is a problem,
> keep ru_wallclock as at the moment most time is spent in converting time
> formats.
>
> I'm also working on a gridEngine scheduler that works well with
> starcluster, that is it keeps the most recently booted nodes (and master)
> the most loaded so when you get to nearly the end of the hour there's nodes
> free you can take down.   I've just got this going.   Next I'd like to
> distribute the load evenly across all nodes that are up (these are vCPU,
> lightly loaded runs much faster) unless they are near the end of the hour,
> and in that case make sure the ones nearest the end are empty.   I'm happy
> to go into details but I fear there aren't that many users of starcluster
> who really care about  getting things going efficiently for short running
> jobs (or the above bug would have been fixed) so I'm talking to myself.
>
>
> Tony
>
>
> On 25/03/16 19:56, Tony Robinson wrote:
>
> Hi Rajat,
>
> The main issue that I have with the load balancer is sometimes bringing up
> a node or taking down a node fails and this caused the loadbalancer to fall
> over.   This is almost certainly an issue with boto - I just haven't looked
> into it enough.
>
> I'm working on the loadbalancer right now.   I'm running a few different
> sorts of jobs, some take half a minute some take five minutes.   It takes
> me about five minutes to bring a node up, so load balancing is quite a hard
> task, certainly what's there at the moment isn't optimal.
>
> In your masters thesis you had a go at anticipating the future load based
> on the queue, although I see no trace of this in the current code.   What
> seems like the most obvious approach to me is to look at what's running and
> in the queue and see if it's all going to complete within some specified
> period.   If it is, then fine, if not assume you are going to bring n nodes
> up (start at n=1) and then see if it'll complete, if not then increment n.
>
> I've got a version of this running but it isn't completed because
> avg_job_duration() consistently under reports.   I'm doing some debugging,
> it seems that jobstats[] has a bug, I have three type of job, a start,
> middle and end, and as they are all run in sequence then jobstats[] should
> have equal numbers of each.   It doesn't.
>
> This is a weekend (with unreliable time) activity for me.   If you or
> anyone else wants to help:
>
> a) getting avg_job_duration() working  which probably means fixing
> jobstats[]
> b) getting a clean simple predictive load balancer working
>
> then please contact me.
>
>
> Tony
>
> On 25/03/16 17:17, Rajat Banerjee wrote:
>
> I'll fix any issues with the load balancer if they come up.
>
>
>
> --
> Speechmatics is a trading name of Cantab Research Limited
> We are hiring: www.speechmatics.com/careers
> Dr A J Robinson, Founder, Cantab Research Ltd
> Phone direct: 01223 794096, office: 01223 794497
> Company reg no GB 05697423, VAT reg no 925606030
> 51 Canterbury Street, Cambridge, CB4 3QG, UK
>
>
>
> --
> Speechmatics is a trading name of Cantab Research Limited
> We are hiring: www.speechmatics.com/careers
> Dr A J Robinson, Founder, Cantab Research Ltd
> Phone direct: 01223 794096, office: 01223 794497
> Company reg no GB 05697423, VAT reg no 925606030
> 51 Canterbury Street, Cambridge, CB4 3QG, UK
>
> _______________________________________________
> StarCluster mailing list
> StarCluster at mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20160401/62d3024b/attachment.html


More information about the StarCluster mailing list