[StarCluster] minimal cost with loadbalance

Thu May 29 10:29:02 EDT 2014

You can use StephansBlog method as well, maybe an easier plugin than seq#?:

http://wiki.gridengine.info/wiki/index.php/StephansBlog

To proof-of-concept, I did NOT create a new plugin, but modified the sge plugin instead (the sge.py template and sge.py plugin code) -- so probably not a great solution in the long run -- but it works as expected. Feel free to create your own plugin from these mods? It would be cool if this was in starcluster already (or the seq# bit), so that users only need to modify their scheduler config to force this ‘fill up’ behavior.

$ diff templates/sge.py.dist templates/sge.py
88a89,100
>
> sge_exec_template = """
> hostname              %s
> load_scaling          NONE
> complex_values        slots=%s
> user_lists            NONE
> xuser_lists           NONE
> projects              NONE
> xprojects             NONE
> usage_scaling         NONE
> report_variables      NONE
> """
$ diff plugins/sge.py.dist plugins/sge.py
106a107,111
>             master = self._master
>             execconf = master.ssh.remote_file("/tmp/execconf.txt", "w")
>             execconf.write(sge.sge_exec_template % (node.alias, num_slots))
>             execconf.close()
>             master.ssh.execute('qconf -Me %s' % execconf.name)

For it to work, SGE needs scheduler conf adjusted as well (qconf -msconf), didn’t do that in StarCluster, as this is just a proof-of-concept and the master stays up anyway:

algorithm                         default
schedule_interval                 0:2:0
maxujobs                          0
queue_sort_method                 load
job_load_adjustments              NONE
load_adjustment_decay_time        0:0:0
load_formula                      slots
schedd_job_info                   true
flush_submit_sec                  1
flush_finish_sec                  1

Cheers,
-Hugh

From: starcluster-bounces at mit.edu [mailto:starcluster-bounces at mit.edu] On Behalf Of Rayson Ho
Sent: Thursday, May 29, 2014 7:54 AM
To: David Mrva
Cc: starcluster at mit.edu
Subject: Re: [StarCluster] minimal cost with loadbalance

You can set the Grid Engine "queue_sort_method" parameter to "seq_no" in sched_conf:

http://gridscheduler.sourceforge.net/htmlman/htmlman5/sched_conf.html
And for this to work, we need each instance to have a different "seq_no", so a small StarCluster plugin will need to be developed -- ie. the plugin will assign a new seq_no when an instance gets created.

Rayson

==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/
http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html

On Thu, May 29, 2014 at 3:10 AM, David Mrva <davidm at cantabresearch.com<mailto:davidm at cantabresearch.com>> wrote:
Hello,

I stared using StarCluster with Amazon spot instances. I expect that the
workload of my application will fluctuate a lot and I aim to minimise
the cost of running the spot instances. StarCluster's loadbalancer seems
to go some way in this direction. It adds more spot instances when the
SGE queue is busy and removes unused nodes. The removal of the nodes
interacts with SGE's strategy for assigning jobs to queues. SGE chooses
the node with the lowest load average to assign a job to. If there are
more nodes in the cluster than are necessary to execute the jobs, this
strategy will result in spreading the jobs that need to be executed
across as many nodes as possible. This behaviour reduces the chances of
some of the nodes staying unused and potentially being removed by the
load balancer.

I'd like to configure StarCluster in such a way that SGE jobs go to node
A for as long as there are slots available on it and they go to node B
only if there is no vacant slot on node A. For example, on a cluster
with nodes A and B and 8 slots on each node if there are 4 slots being
used on node A and 4 more jobs arrive to SGE, I'd like all 4 of these
new jobs to go node A. Using the "orte" parallel environment with
"fill_up" allocation strategy does not achieve this. For the above
example, using the "fill_up" allocation strategy will pick node B
(lowest load average node) and assign all 4 new jobs to it, resulting in
nodes A and B running 4 jobs each instead of A running 8 jobs and B none.

How can I use StarCluster's built-in load balancer to minimise the cost
of running spot instances by minimising the number unused CPUs in the
way described above?

Many thanks,
David
_______________________________________________
StarCluster mailing list
StarCluster at mit.edu<mailto:StarCluster at mit.edu>
http://mailman.mit.edu/mailman/listinfo/starcluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20140529/9eec058b/attachment-0001.htm