[StarCluster] Is StarCluster still under active development?

Mon Jun 13 22:17:28 EDT 2016

I'm mostly there with my loadbalancer, so I thought that it would be 
worth a write up for those that are interested.

Just to restate my aims, I have a lot of jobs with duration of about 90s 
and these are very bursty.   So I can easily have 400 jobs queued to run 
and *not* want to bring up any new nodes. I've somewhat arbitrarily 
picked 240s as the maximum time I want any job to wait before running.

The first thing I needed to do was to reduce the poll time.   As my jobs 
take about 90s I need to poll much more frequent than that (basic 
sampling theorem) - I picked 30s although I am tempted to reduce this.

So the first improvement was to make the polling time a real polling 
time, not a wait time.    This is just setting a timer at the start of 
the polling loop and sleeping for the appropriate time at the end.

Also, to speed things up I eliminated all settle time.   We already know 
how many nodes are up and how many jobs are running, so it was a simple 
matter to assume that the difference would run soon and ignore these at 
the start of the job queue.

As noted earlier in the thread, the existing code doesn't properly 
sample the past job durations, so I simply reload the whole 
lookback_window (default 3 hours) every time.   It doesn't take long and 
if speed were an issue then there's lots of code that changes between 
date formats that could be integers on reading.

I also read the job name.   This allows me to calculate the mean and 
variance of each job type.    I am estimating the job duration as mean + 
0.5 * sqrt(var / njob), so as I get hundreds of jobs it's pretty much 
the mean.

With all of this in place I can estimate how long it's going to take to 
run the queue.   At the moment I'm ignoring the timings of running jobs 
and assuming queued jobs come from the same distribution, so the time 
taken to run is the job duration (calculated above) divided by the 
number of slots available.   I know the how long each job has been 
waiting and how long I expect each job to wait, so I can see if any job 
would increase my maximum job wait time.   If it does, I assume that 
I've got another node up instantly (clearly false, they take more than 
my 240s to come up) and rerun the calculation until I know how many 
nodes I need to add.   In practice, it takes so long to boot nodes that 
there's no point trying to bring more than 2 or 4 up.    This is because 
they come up serially, all of this will change when I update startcluster.

The other important change is to always load the most loaded machine, 
with a slight preference for low node numbers when all are empty (so 
that master gets loaded).   This is really important with such variable 
load you've got to bring nodes down at the end of the hour and so need 
as many that are empty as possible.

algorithm                         default
schedule_interval                 0:0:05
maxujobs                          0
queue_sort_method                 load
job_load_adjustments              np_load_avg=0.50
load_adjustment_decay_time        0:7:30
load_formula                      host_rank-load_short*256

So that's about it.   Just recently I had 496 slots running, the above 
algorithm brought up just enough nodes to cope with the load and brought 
some of them down 15 mins later when the load decreased.

Overall we've doubled our efficiency using this algorithm, that is we 
can provide the same quality service at half the variable cost.

The code is not in a state to be shared publicly, but I'm happy to share 
privately.   It really needs a starcluster upgrade (thanks Mich for the 
email) so that nodes can be brought up in parallel, we should do this 
shortly and then I'll work on this some more.

Tony //

On 03/04/16 13:41, Tony Robinson wrote:
> Okay, I've found another bug with the load balancer which explains why 
> avg_job_duration() was getting shorter and shorter.
>
> get_qatime() initially loads the whole (3 hours) history, but after 
> that sets  temp_lookback_window = self.polling_interval
>
> The problem with this is self.polling_interval has to be much shorter 
> than a job duration (it's got to be able to keep up) and the -b option 
> to qacct sets "The earliest start time for jobs to be summarized,", so 
> it only selects jobs that have been started recently and finished (so 
> that they get into qacct) - hence they must be the very short jobs.   
> Hence the cache is originally populated quite reasonably but then only 
> gets updated with very short jobs, all the long ones never get into 
> the cache.
>
> As I say below, I don't think any of this code is used anyway so it 
> doesn't matter too much that it's all broken.
>
> I'll progress with my (weekend and part time) clean up and 
> implementation of a true predictive load balancer.  I have both (a) 
> mean and variance for all job types and (b) working code assuming that 
> avg_job_duration() is correct, so it's probably only another days work 
> to get solid (or a month or two of elapsed time, I'm done for this 
> weekend).
>
>
> Tony
>
> On 01/04/16 17:01, Tony Robinson wrote:
>> On 01/04/16 16:22, Rajat Banerjee wrote:
>>> Regarding:
>>> How about we just call qacct every 5 mins, or if the qacct buffer is 
>>> empty.
>>> calling qacct and getting the job stats is the first part of the 
>>> load balancers loop to see what the cluster is up to. I prioritized 
>>> knowing the current state, and keeping the LB running it's loop as 
>>> fast as possible (2-10 seconds), so it could run in a 1-minute loop 
>>> and stay roughly on-schedule. It's easy to run the whole LB loop 
>>> with 5 minutes between loops with the command line arg 
>>> polling_interval, if that suits your workload better. I do not mean 
>>> to sound dismissive, but the command line options (with reasonable 
>>> defaults)are there so you can test and tweak to your work load.
>>
>> Ah, I wasn't very clear.   What I mean is that we only update the 
>> qacct stats every 5 minutes.   I run the main loop every 30s.
>>
>> But calling qacct doesn't' take any time - we could do it every 
>> polling interval:
>>
>> root at master:~# date
>> Fri Apr  1 16:54:31 BST 2016
>> root at master:~# echo qacct -j -b `date +%y%m%d`$((`date +%H` - 
>> 3))`date +%m`
>> qacct -j -b 1604011304
>> root at master:~# time  qacct -j -b `date +%y%m%d`$((`date +%H` - 
>> 3))`date +%m` | wc
>>   99506  224476 3307423
>>
>> real    0m0.588s
>> user    0m0.560s
>> sys    0m0.076s
>> root at master:~#
>>
>>
>> If calling qacct is slow then the update could be run at the end of 
>> the loop so it would have all of the loop wait time to complete in.
>>
>>> Regarding:
>>> Three sorts of jobs, all of which should occur in the same numbers,
>>> Have you tried testing your call to qacct to see if it's returning 
>>> what you want? You could modify it in your source if it's not 
>>> representative of your jobs:
>>> https://github.com/jtriley/StarCluster/blob/develop/starcluster/balancers/sge/__init__.py#L528
>>> qacct_cmd = 'qacct -j -b ' + qatime
>>
>> Yes, thanks, I'm comparing to running qacct outside of the load balancer.
>>
>>> Obviously one size doesn't fit all here, but if you find a set of 
>>> args for qacct that work better for you, let me know.
>>
>> At the moment I don't think that the output of qacct is used at all 
>> is it?   I thought it was only used to give job stats, I don't think 
>> it's really used to bring nodes up/down.
>>
>>
>> Tony
>>
>> -- 
>> Speechmatics is a trading name of Cantab Research Limited
>> We are hiring: www.speechmatics.com/careers 
>> <http:www.speechmatics.com/careers>
>> Dr A J Robinson, Founder, Cantab Research Ltd
>> Phone direct: 01223 794096, office: 01223 794497
>> Company reg no GB 05697423, VAT reg no 925606030
>> 51 Canterbury Street, Cambridge, CB4 3QG, UK
>
>
> -- 
> Speechmatics is a trading name of Cantab Research Limited
> We are hiring: www.speechmatics.com/careers 
> <https://www.speechmatics.com/careers>
> Dr A J Robinson, Founder, Cantab Research Ltd
> Phone direct: 01223 794096, office: 01223 794497
> Company reg no GB 05697423, VAT reg no 925606030
> 51 Canterbury Street, Cambridge, CB4 3QG, UK

-- 
Speechmatics is a trading name of Cantab Research Limited
We are hiring: www.speechmatics.com/careers 
<https://www.speechmatics.com/careers>
Dr A J Robinson, Founder, Cantab Research Ltd
Phone direct: 01223 794096, office: 01223 794497
Company reg no GB 05697423, VAT reg no 925606030
51 Canterbury Street, Cambridge, CB4 3QG, UK
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20160613/640b966f/attachment-0001.html