[StarCluster] Configuring number of map jobs per cluster node {Hadoop plugin}

Fri Jun 1 14:45:48 EDT 2012

Hi Paul,

I've looked into our Grid Engine Hadoop integration... I think you
will need to change the needed parameters before Hadoop TaskTrackers
start up.

You will need to modify conf/mapred-site.xml, and specifically the
following parameter:

  mapred.tasktracker.{map|reduce}.tasks.maximum

You can reference the following page:

http://hadoop.apache.org/common/docs/stable/cluster_setup.html#Configuring+the+Hadoop+Daemons

I googled and people say that each TaskTracker gets this number from
its local config, and then reports it to the JobTracker, who in turn
calculates the total capacity of the cluster.

I will test this parameter a bit more later this year, but if you want
to change the behavior of your cluster now, you can easily set up a
small t1.micro cluster (but don't run real compute tasks on it as it
will be slow), and change the parameter in the conf/mapred-site.xml
file, and see if that gives you the behavior you need.

Rayson

================================
Open Grid Scheduler / Grid Engine
http://gridscheduler.sourceforge.net/

Scalable Grid Engine Support Program
http://www.scalablelogic.com/

On Fri, Jun 1, 2012 at 10:00 AM, Paul McDonagh <mcdonaghpd at gmail.com> wrote:
> Hi Rayson,
>
> Thanks for the link; I saw that too. However, clicking on the links behind
> a) mapred.tasktracker.map.tasks.maximium or
> b) mapred.tasktracker.reduce.tasks.maximum
> to find out how to use the "configuration knobs" takes you to invalid webpages.
>
> A little more info: I'm using R and hadoop together via the rmr package.
>
> I've come up short on further searches to find out how/where to set those parameters. There's general discussion over whether these should even be allowed to be set by the user. It's not clear to me at least whether these are parameters that would be set on initialization of hadoop or on an individual job submission but it would seem reasonable that you could assign these parameters on a per compute node basis if you had a heteregeneous cluster when hadoop is initialized.
>
> In short, it appears that there is no benefit at the moment in having nodes in a hadoop cluster with more than 2 compute cores which means I have to instantiate very large clusters with all the associated network and I/O latency that comes with EC2 smaller nodes.
>
> Any thoughts?
>
> Paul.
>
>
> On May 31, 2012, at 15:38, Rayson Ho wrote:
>
>> While integrating some user contributed Hadoop docs into the Open Grid
>> Scheduler website, I came across the
>> "mapred.tasktracker.map.tasks.maximum" parameter - a quick Google
>> search points me to:
>>
>> Q: I see a maximum of 2 maps/reduces spawned concurrently on each
>> TaskTracker, how do I increase that?
>> A: Use the configuration knob: mapred.tasktracker.map.tasks.maximum
>> and mapred.tasktracker.reduce.tasks.maximum to control the number of
>> maps/reduces spawned simultaneously on a TaskTracker. By default, it
>> is set to 2, hence one sees a maximum of 2 maps and 2 reduces at a
>> given instance on a TaskTracker.
>>
>> Ref: http://wiki.apache.org/hadoop/FAQ#I_see_a_maximum_of_2_maps.2BAC8-reduces_spawned_concurrently_on_each_TaskTracker.2C_how_do_I_increase_that.3F
>>
>> Make be it is a matter of setting the parameter??
>>
>> Rayson
>>
>> ================================
>> Open Grid Scheduler / Grid Engine
>> http://gridscheduler.sourceforge.net/
>>
>> Scalable Grid Engine Support Program
>> http://www.scalablelogic.com/
>>
>>
>>
>> On Wed, May 30, 2012 at 2:07 PM, Paul McDonagh <mcdonaghpd at gmail.com> wrote:
>>> Thanks for creating starcluster, it's great.  I'm using the Hadoop plugin and I'm working on a c1.xlarge instance type. The c1.xlarge type has 20 EC2 Compute units or 8 virtual cores.
>>>
>>> When looking at the job tracking webpages that are set up after the cluster is initiated and running, there is a limit of 2 map jobs per cluster node. How can I alter the number map (or reduce) jobs a particular compute node can run? I can't seem to find how to change this. I'd like to be able to use much more of the compute resources for some of the larger compute instance types.
>>>
>>> Thanks for your help.
>>> Paul McDonagh
>>>
>>>
>>>
>>> _______________________________________________
>>> StarCluster mailing list
>>> StarCluster at mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>>
>>
>> --
>> ==================================================
>> Open Grid Scheduler - The Official Open Source Grid Engine
>> http://gridscheduler.sourceforge.net/
>

-- 
==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/