[StarCluster] Is StarCluster still under active development?
Tony Robinson
tonyr at speechmatics.com
Mon Apr 4 03:08:26 EDT 2016
Sorry Raj, I don't mean to seem "cranky that the ELB isn't meeting your
predictive load balancing needs". It's clear that the released code
doesn't do predictive load balancing, I'm just trying to get it working
so that it can answer the question "Are there so many jobs that the
queue would have items after 5 minutes" (your masters thesis, figure 9).
Thanks for your help so far, and of course for writing and releasing the
code in the first place. I do hope I can get this finished and that
users will find it useful even when job durations have a high variance
(I'm relying on having lots of jobs running to bring the variance under
control).
Tony
On 04/04/16 04:37, Rajat Banerjee wrote:
> Hi Tony,
> Interesting findings. Yes that does seem like a bug with
> polling_interval. All of the logic to make a load balancing decision
> is done in _eval_add_node and doesn't consider average job duration.
> It was a design decision at the time: look at the queue and current
> slot count to see if throughput meets the parameter
> longest_allowed_queue_time, or whether we should try to predict job
> durations. We surveyed some of the most active users at that time and
> most said their job sizes were varying and unpredictable and not
> amenable to 'prediction'. So we went with the former.
>
> Thanks for sharing your findings, but I'm not sure why you seem cranky
> that the ELB isn't meeting your predictive load balancing needs and
> has some bugs in unused code? It's open source for a reason, so people
> can adapt and improve it.
> best,
> Raj
>
> On Sun, Apr 3, 2016 at 8:41 AM, Tony Robinson <tonyr at speechmatics.com
> <mailto:tonyr at speechmatics.com>> wrote:
>
> Okay, I've found another bug with the load balancer which explains
> why avg_job_duration() was getting shorter and shorter.
>
> get_qatime() initially loads the whole (3 hours) history, but
> after that sets temp_lookback_window = self.polling_interval
>
> The problem with this is self.polling_interval has to be much
> shorter than a job duration (it's got to be able to keep up) and
> the -b option to qacct sets "The earliest start time for jobs to
> be summarized,", so it only selects jobs that have been started
> recently and finished (so that they get into qacct) - hence they
> must be the very short jobs. Hence the cache is originally
> populated quite reasonably but then only gets updated with very
> short jobs, all the long ones never get into the cache.
>
> As I say below, I don't think any of this code is used anyway so
> it doesn't matter too much that it's all broken.
>
> I'll progress with my (weekend and part time) clean up and
> implementation of a true predictive load balancer. I have both (a)
> mean and variance for all job types and (b) working code assuming
> that avg_job_duration() is correct, so it's probably only another
> days work to get solid (or a month or two of elapsed time, I'm
> done for this weekend).
>
>
> Tony
>
>
> On 01/04/16 17:01, Tony Robinson wrote:
>> On 01/04/16 16:22, Rajat Banerjee wrote:
>>> Regarding:
>>> How about we just call qacct every 5 mins, or if the qacct
>>> buffer is empty.
>>> calling qacct and getting the job stats is the first part of the
>>> load balancers loop to see what the cluster is up to. I
>>> prioritized knowing the current state, and keeping the LB
>>> running it's loop as fast as possible (2-10 seconds), so it
>>> could run in a 1-minute loop and stay roughly on-schedule. It's
>>> easy to run the whole LB loop with 5 minutes between loops with
>>> the command line arg polling_interval, if that suits your
>>> workload better. I do not mean to sound dismissive, but the
>>> command line options (with reasonable defaults)are there so you
>>> can test and tweak to your work load.
>>
>> Ah, I wasn't very clear. What I mean is that we only update the
>> qacct stats every 5 minutes. I run the main loop every 30s.
>>
>> But calling qacct doesn't' take any time - we could do it every
>> polling interval:
>>
>> root at master:~# date
>> Fri Apr 1 16:54:31 BST 2016
>> root at master:~# echo qacct -j -b `date +%y%m%d`$((`date +%H` -
>> 3))`date +%m`
>> qacct -j -b 1604011304
>> root at master:~# time qacct -j -b `date +%y%m%d`$((`date +%H` -
>> 3))`date +%m` | wc
>> 99506 224476 3307423
>>
>> real 0m0.588s
>> user 0m0.560s
>> sys 0m0.076s
>> root at master:~#
>>
>>
>> If calling qacct is slow then the update could be run at the end
>> of the loop so it would have all of the loop wait time to
>> complete in.
>>
>>> Regarding:
>>> Three sorts of jobs, all of which should occur in the same numbers,
>>> Have you tried testing your call to qacct to see if it's
>>> returning what you want? You could modify it in your source if
>>> it's not representative of your jobs:
>>> https://github.com/jtriley/StarCluster/blob/develop/starcluster/balancers/sge/__init__.py#L528
>>> qacct_cmd = 'qacct -j -b ' + qatime
>>
>> Yes, thanks, I'm comparing to running qacct outside of the load
>> balancer.
>>
>>> Obviously one size doesn't fit all here, but if you find a set
>>> of args for qacct that work better for you, let me know.
>>
>> At the moment I don't think that the output of qacct is used at
>> all is it? I thought it was only used to give job stats, I
>> don't think it's really used to bring nodes up/down.
>>
>>
>> Tony
>>
>> --
>> Speechmatics is a trading name of Cantab Research Limited
>> We are hiring: www.speechmatics.com/careers
>> <http:www.speechmatics.com/careers>
>> Dr A J Robinson, Founder, Cantab Research Ltd
>> Phone direct: 01223 794096, office: 01223 794497
>> Company reg no GB 05697423, VAT reg no 925606030
>> 51 Canterbury Street, Cambridge, CB4 3QG, UK
>
>
> --
> Speechmatics is a trading name of Cantab Research Limited
> We are hiring: www.speechmatics.com/careers
> <https://www.speechmatics.com/careers>
> Dr A J Robinson, Founder, Cantab Research Ltd
> Phone direct: 01223 794096, office: 01223 794497
> Company reg no GB 05697423, VAT reg no 925606030
> 51 Canterbury Street, Cambridge, CB4 3QG, UK
>
> _______________________________________________
> StarCluster mailing list
> StarCluster at mit.edu <mailto:StarCluster at mit.edu>
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
--
Speechmatics is a trading name of Cantab Research Limited
We are hiring: www.speechmatics.com/careers
<https://www.speechmatics.com/careers>
Dr A J Robinson, Founder, Cantab Research Ltd
Phone direct: 01223 794096, office: 01223 794497
Company reg no GB 05697423, VAT reg no 925606030
51 Canterbury Street, Cambridge, CB4 3QG, UK
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20160404/6552c3ae/attachment-0001.html
More information about the StarCluster
mailing list