[StarCluster] Configuring number of map jobs per cluster node {Hadoop plugin}
Justin Riley
jtriley at MIT.EDU
Wed Jun 6 11:15:25 EDT 2012
Paul, Rayson,
We can update the hadoop plugin[1] to allow configuring the maximum
number of map/reduce tasks per task tracker. I've created an issue to
keep track of this:
http://web.mit.edu/star/cluster/issues/115
[1] https://github.com/jtriley/StarCluster/blob/develop/starcluster/plugins/hadoop.py
~Justin
On Fri, Jun 01, 2012 at 02:45:48PM -0400, Rayson Ho wrote:
> Hi Paul,
>
> I've looked into our Grid Engine Hadoop integration... I think you
> will need to change the needed parameters before Hadoop TaskTrackers
> start up.
>
> You will need to modify conf/mapred-site.xml, and specifically the
> following parameter:
>
> mapred.tasktracker.{map|reduce}.tasks.maximum
>
> You can reference the following page:
>
> http://hadoop.apache.org/common/docs/stable/cluster_setup.html#Configuring+the+Hadoop+Daemons
>
>
> I googled and people say that each TaskTracker gets this number from
> its local config, and then reports it to the JobTracker, who in turn
> calculates the total capacity of the cluster.
>
> I will test this parameter a bit more later this year, but if you want
> to change the behavior of your cluster now, you can easily set up a
> small t1.micro cluster (but don't run real compute tasks on it as it
> will be slow), and change the parameter in the conf/mapred-site.xml
> file, and see if that gives you the behavior you need.
>
> Rayson
>
> ================================
> Open Grid Scheduler / Grid Engine
> http://gridscheduler.sourceforge.net/
>
> Scalable Grid Engine Support Program
> http://www.scalablelogic.com/
>
>
> On Fri, Jun 1, 2012 at 10:00 AM, Paul McDonagh <mcdonaghpd at gmail.com> wrote:
> > Hi Rayson,
> >
> > Thanks for the link; I saw that too. However, clicking on the links behind
> > a) mapred.tasktracker.map.tasks.maximium or
> > b) mapred.tasktracker.reduce.tasks.maximum
> > to find out how to use the "configuration knobs" takes you to invalid webpages.
> >
> > A little more info: I'm using R and hadoop together via the rmr package.
> >
> > I've come up short on further searches to find out how/where to set those parameters. There's general discussion over whether these should even be allowed to be set by the user. It's not clear to me at least whether these are parameters that would be set on initialization of hadoop or on an individual job submission but it would seem reasonable that you could assign these parameters on a per compute node basis if you had a heteregeneous cluster when hadoop is initialized.
> >
> > In short, it appears that there is no benefit at the moment in having nodes in a hadoop cluster with more than 2 compute cores which means I have to instantiate very large clusters with all the associated network and I/O latency that comes with EC2 smaller nodes.
> >
> > Any thoughts?
> >
> > Paul.
> >
> >
> > On May 31, 2012, at 15:38, Rayson Ho wrote:
> >
> >> While integrating some user contributed Hadoop docs into the Open Grid
> >> Scheduler website, I came across the
> >> "mapred.tasktracker.map.tasks.maximum" parameter - a quick Google
> >> search points me to:
> >>
> >> Q: I see a maximum of 2 maps/reduces spawned concurrently on each
> >> TaskTracker, how do I increase that?
> >> A: Use the configuration knob: mapred.tasktracker.map.tasks.maximum
> >> and mapred.tasktracker.reduce.tasks.maximum to control the number of
> >> maps/reduces spawned simultaneously on a TaskTracker. By default, it
> >> is set to 2, hence one sees a maximum of 2 maps and 2 reduces at a
> >> given instance on a TaskTracker.
> >>
> >> Ref: http://wiki.apache.org/hadoop/FAQ#I_see_a_maximum_of_2_maps.2BAC8-reduces_spawned_concurrently_on_each_TaskTracker.2C_how_do_I_increase_that.3F
> >>
> >> Make be it is a matter of setting the parameter??
> >>
> >> Rayson
> >>
> >> ================================
> >> Open Grid Scheduler / Grid Engine
> >> http://gridscheduler.sourceforge.net/
> >>
> >> Scalable Grid Engine Support Program
> >> http://www.scalablelogic.com/
> >>
> >>
> >>
> >> On Wed, May 30, 2012 at 2:07 PM, Paul McDonagh <mcdonaghpd at gmail.com> wrote:
> >>> Thanks for creating starcluster, it's great. I'm using the Hadoop plugin and I'm working on a c1.xlarge instance type. The c1.xlarge type has 20 EC2 Compute units or 8 virtual cores.
> >>>
> >>> When looking at the job tracking webpages that are set up after the cluster is initiated and running, there is a limit of 2 map jobs per cluster node. How can I alter the number map (or reduce) jobs a particular compute node can run? I can't seem to find how to change this. I'd like to be able to use much more of the compute resources for some of the larger compute instance types.
> >>>
> >>> Thanks for your help.
> >>> Paul McDonagh
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> StarCluster mailing list
> >>> StarCluster at mit.edu
> >>> http://mailman.mit.edu/mailman/listinfo/starcluster
> >>
> >>
> >>
> >> --
> >> ==================================================
> >> Open Grid Scheduler - The Official Open Source Grid Engine
> >> http://gridscheduler.sourceforge.net/
> >
>
>
>
> --
> ==================================================
> Open Grid Scheduler - The Official Open Source Grid Engine
> http://gridscheduler.sourceforge.net/
>
> _______________________________________________
> StarCluster mailing list
> StarCluster at mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
Url : http://mailman.mit.edu/pipermail/starcluster/attachments/20120606/b2279a13/attachment.bin
More information about the StarCluster
mailing list