[StarCluster] minimal cost with loadbalance
MacMullan, Hugh
hughmac at wharton.upenn.edu
Thu May 29 10:29:02 EDT 2014
You can use StephansBlog method as well, maybe an easier plugin than seq#?:
http://wiki.gridengine.info/wiki/index.php/StephansBlog
To proof-of-concept, I did NOT create a new plugin, but modified the sge plugin instead (the sge.py template and sge.py plugin code) -- so probably not a great solution in the long run -- but it works as expected. Feel free to create your own plugin from these mods? It would be cool if this was in starcluster already (or the seq# bit), so that users only need to modify their scheduler config to force this ‘fill up’ behavior.
$ diff templates/sge.py.dist templates/sge.py
88a89,100
>
> sge_exec_template = """
> hostname %s
> load_scaling NONE
> complex_values slots=%s
> user_lists NONE
> xuser_lists NONE
> projects NONE
> xprojects NONE
> usage_scaling NONE
> report_variables NONE
> """
$ diff plugins/sge.py.dist plugins/sge.py
106a107,111
> master = self._master
> execconf = master.ssh.remote_file("/tmp/execconf.txt", "w")
> execconf.write(sge.sge_exec_template % (node.alias, num_slots))
> execconf.close()
> master.ssh.execute('qconf -Me %s' % execconf.name)
For it to work, SGE needs scheduler conf adjusted as well (qconf -msconf), didn’t do that in StarCluster, as this is just a proof-of-concept and the master stays up anyway:
algorithm default
schedule_interval 0:2:0
maxujobs 0
queue_sort_method load
job_load_adjustments NONE
load_adjustment_decay_time 0:0:0
load_formula slots
schedd_job_info true
flush_submit_sec 1
flush_finish_sec 1
Cheers,
-Hugh
From: starcluster-bounces at mit.edu [mailto:starcluster-bounces at mit.edu] On Behalf Of Rayson Ho
Sent: Thursday, May 29, 2014 7:54 AM
To: David Mrva
Cc: starcluster at mit.edu
Subject: Re: [StarCluster] minimal cost with loadbalance
You can set the Grid Engine "queue_sort_method" parameter to "seq_no" in sched_conf:
http://gridscheduler.sourceforge.net/htmlman/htmlman5/sched_conf.html
And for this to work, we need each instance to have a different "seq_no", so a small StarCluster plugin will need to be developed -- ie. the plugin will assign a new seq_no when an instance gets created.
Rayson
==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/
http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
On Thu, May 29, 2014 at 3:10 AM, David Mrva <davidm at cantabresearch.com<mailto:davidm at cantabresearch.com>> wrote:
Hello,
I stared using StarCluster with Amazon spot instances. I expect that the
workload of my application will fluctuate a lot and I aim to minimise
the cost of running the spot instances. StarCluster's loadbalancer seems
to go some way in this direction. It adds more spot instances when the
SGE queue is busy and removes unused nodes. The removal of the nodes
interacts with SGE's strategy for assigning jobs to queues. SGE chooses
the node with the lowest load average to assign a job to. If there are
more nodes in the cluster than are necessary to execute the jobs, this
strategy will result in spreading the jobs that need to be executed
across as many nodes as possible. This behaviour reduces the chances of
some of the nodes staying unused and potentially being removed by the
load balancer.
I'd like to configure StarCluster in such a way that SGE jobs go to node
A for as long as there are slots available on it and they go to node B
only if there is no vacant slot on node A. For example, on a cluster
with nodes A and B and 8 slots on each node if there are 4 slots being
used on node A and 4 more jobs arrive to SGE, I'd like all 4 of these
new jobs to go node A. Using the "orte" parallel environment with
"fill_up" allocation strategy does not achieve this. For the above
example, using the "fill_up" allocation strategy will pick node B
(lowest load average node) and assign all 4 new jobs to it, resulting in
nodes A and B running 4 jobs each instead of A running 8 jobs and B none.
How can I use StarCluster's built-in load balancer to minimise the cost
of running spot instances by minimising the number unused CPUs in the
way described above?
Many thanks,
David
_______________________________________________
StarCluster mailing list
StarCluster at mit.edu<mailto:StarCluster at mit.edu>
http://mailman.mit.edu/mailman/listinfo/starcluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20140529/9eec058b/attachment-0001.htm
More information about the StarCluster
mailing list