Setting the number of map tasks and reduce tasks

Asked 30/7, 2011 at 19:16 Answered 21/2, 2017 at 13:6

I am currently running a job I fixed the number of map task to 20 but and getting a higher number. I also set the reduce task to zero but I am still getting a number other than zero. The total time for the MapReduce job to complete is also not display. Can someone tell me what I am doing wrong. I am using this command

hadoop jar Test_Parallel_for.jar Test_Parallel_for Matrix/test4.txt Result 3 \ -D mapred.map.tasks = 20 \ -D mapred.reduce.tasks =0

Output:

11/07/30 19:48:56 INFO mapred.JobClient: Job complete: job_201107291018_0164
11/07/30 19:48:56 INFO mapred.JobClient: Counters: 18
11/07/30 19:48:56 INFO mapred.JobClient:   Job Counters 
11/07/30 19:48:56 INFO mapred.JobClient:     Launched reduce tasks=13
11/07/30 19:48:56 INFO mapred.JobClient:     Rack-local map tasks=12
11/07/30 19:48:56 INFO mapred.JobClient:     Launched map tasks=24
11/07/30 19:48:56 INFO mapred.JobClient:     Data-local map tasks=12
11/07/30 19:48:56 INFO mapred.JobClient:   FileSystemCounters
11/07/30 19:48:56 INFO mapred.JobClient:     FILE_BYTES_READ=4020792636
11/07/30 19:48:56 INFO mapred.JobClient:     HDFS_BYTES_READ=1556534680
11/07/30 19:48:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=6026699058
11/07/30 19:48:56 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1928893942
11/07/30 19:48:56 INFO mapred.JobClient:   Map-Reduce Framework
11/07/30 19:48:56 INFO mapred.JobClient:     Reduce input groups=40000000
11/07/30 19:48:56 INFO mapred.JobClient:     Combine output records=0
11/07/30 19:48:56 INFO mapred.JobClient:     Map input records=40000000
11/07/30 19:48:56 INFO mapred.JobClient:     Reduce shuffle bytes=1974162269
11/07/30 19:48:56 INFO mapred.JobClient:     Reduce output records=40000000
11/07/30 19:48:56 INFO mapred.JobClient:     Spilled Records=120000000
11/07/30 19:48:56 INFO mapred.JobClient:     Map output bytes=1928893942
11/07/30 19:48:56 INFO mapred.JobClient:     Combine input records=0
11/07/30 19:48:56 INFO mapred.JobClient:     Map output records=40000000
11/07/30 19:48:56 INFO mapred.JobClient:     Reduce input records=40000000
[hcrc1425n30]s0907855:

Morning answered 30/7, 2011 at 19:16 Comment(4)

Are you also setting mapred.map.tasks in an xml configuration and/or the main of the class you're running? If so, does changing those settings change the number of tasks being performed? It looks like you are doing this correctly since properties specified at the command line should have the highest precedence. – Ribeiro 30/7, 2011 at 20:35

It should work but I am getting more map tasks than specified. And why is it that I am not getting the total time taken to run job? – Morning 30/7, 2011 at 20:42

I'm not sure about the time not being printed, but a possible source of error for the number of tasks is the spacing in your -D properties. Make sure you are either spelling it -Dproperty=value (with no spaces) or -Dproperty value (with one space) or else it might be parsed wrong. – Ribeiro 30/7, 2011 at 20:58

The number of map task is according to the total size of the input and the block size, i.e. the number of the splits. even though you set the number of map task, that is just a hint. The number of reduce task can be user defined, and if it is not defined explicitly, the default reduce number is 1. more information: search-hadoop.com/c/MapReduce:hadoop-mapreduce-client/… – Monazite 27/2, 2014 at 13:59

The number of map tasks for a given job is driven by the number of input splits and not by the mapred.map.tasks parameter. For each input split a map task is spawned. So, over the lifetime of a mapreduce job the number of map tasks is equal to the number of input splits. mapred.map.tasks is just a hint to the InputFormat for the number of maps.

In your example Hadoop has determined there are 24 input splits and will spawn 24 map tasks in total. But, you can control how many map tasks can be executed in parallel by each of the task tracker.

Also, removing a space after -D might solve the problem for reduce.

For more information on the number of map and reduce tasks, please look at the below url

https://cwiki.apache.org/confluence/display/HADOOP2/HowManyMapsAndReduces

Goosestep answered 31/7, 2011 at 2:34 Comment(1)

There is one master node and 10 slave nodes in my Hadoop/YARN cluster. 5 inputsplits are created for the input sequence file. There is only one mapreduce task spawned on one slavenode in YARN and not on five nodes. Any help how to spwan it on 5 or more ndoes? – Anticatalyst 11/2, 2015 at 10:57

As Praveen mentions above, when using the basic FileInputFormat classes is just the number of input splits that constitute the data. The number of reducers is controlled by mapred.reduce.tasks specified in the way you have it: -D mapred.reduce.tasks=10 would specify 10 reducers. Note that the space after -D is required; if you omit the space, the configuration property is passed along to the relevant JVM, not to Hadoop.

Are you specifying 0 because there is no reduce work to do? In that case, if you're having trouble with the run-time parameter, you can also set the value directly in code. Given a JobConf instance job, call

job.setNumReduceTasks(0);

inside, say, your implementation of Tool.run. That should produce output directly from the mappers. If your job actually produces no output whatsoever (because you're using the framework just for side-effects like network calls or image processing, or if the results are entirely accounted for in Counter values), you can disable output by also calling

job.setOutputFormat(NullOutputFormat.class);

Alexandros answered 6/7, 2012 at 1:12 Comment(0)

It's important to keep in mind that the MapReduce framework in Hadoop allows us only to

suggest the number of Map tasks for a job

which like Praveen pointed out above will correspond to the number of input splits for the task. Unlike it's behavior for the number of reducers (which is directly related to the number of files output by the MapReduce job) where we can

demand that it provide n reducers.

Cosset answered 15/4, 2013 at 21:34 Comment(1)

with which command do you demand n reducers? can you also show a small example if possible? i need to have small size of files on output (just a few mb) – Mojica 11/8, 2016 at 15:6

To explain it with a example:

Assume your hadoop input file size is 2 GB and you set block size as 64 MB so 32 Mappers tasks are set to run while each mapper will process 64 MB block to complete the Mapper Job of your Hadoop Job.

==> Number of mappers set to run are completely dependent on 1) File Size and 2) Block Size

Assume you have running hadoop on a cluster size of 4: Assume you set mapred.map.tasks and mapred.reduce.tasks parameters in your conf file to the nodes as follows:

Node 1: mapred.map.tasks = 4 and mapred.reduce.tasks = 4
Node 2: mapred.map.tasks = 2 and mapred.reduce.tasks = 2
Node 3: mapred.map.tasks = 4 and mapred.reduce.tasks = 4
Node 4: mapred.map.tasks = 1 and mapred.reduce.tasks = 1

Assume you set the above paramters for 4 of your nodes in this cluster. If you notice Node 2 has set only 2 and 2 respectively because the processing resources of the Node 2 might be less e.g(2 Processors, 2 Cores) and Node 4 is even set lower to just 1 and 1 respectively might be due to processing resources on that node is 1 processor, 2 cores so can't run more than 1 mapper and 1 reducer task.

So when you run the job Node 1, Node 2, Node 3, Node 4 are configured to run a max. total of (4+2+4+1)11 mapper tasks simultaneously out of 42 mapper tasks that needs to be completed by the Job. After each Node completes its map tasks it will take the remaining mapper tasks left in 42 mapper tasks.

Now comming to reducers, as you set mapred.reduce.tasks = 0 so we only get mapper output in to 42 files(1 file for each mapper task) and no reducer output.

Irrelative answered 24/1, 2014 at 7:28 Comment(1)

Number of mappers set to run are completely dependent on 1) File Size and 2) Block Size, I think should be 1) File Size and 2) Split Size, and the number of mappers equals to the number of splits. Here is a reference: #30549761 – Somersomers 20/5, 2016 at 3:6

In the newer version of Hadoop, there are much more granular mapreduce.job.running.map.limit and mapreduce.job.running.reduce.limit which allows you to set the mapper and reducer count irrespective of hdfs file split size. This is helpful if you are under constraint to not take up large resources in the cluster.

JIRA

Umont answered 20/8, 2015 at 16:4 Comment(2)

My downvote was an error - I actually wanted to upvote! – Detach 20/10, 2015 at 8:16

Those parameters control only "the maximum simultaneously-running tasks", not total number of mappers/reducers. I am not sure how those parameters are useful? I'd rather let yarn control concurrency across cluster. More important is total number of mappers/reducers. Not sure if it's relevant to the question aboove. Thank you. – Anility 30/1, 2016 at 2:10

From your log I understood that you have 12 input files as there are 12 local maps generated. Rack Local maps are spawned for the same file if some of the blocks of that file are in some other data node. How many data nodes you have?

Loraine answered 23/2, 2012 at 7:34 Comment(0)

In your example, the -D parts are not picked up:

hadoop jar Test_Parallel_for.jar Test_Parallel_for Matrix/test4.txt Result 3 \ -D mapred.map.tasks = 20 \ -D mapred.reduce.tasks =0

They should come after the classname part like this:

hadoop jar Test_Parallel_for.jar Test_Parallel_for -Dmapred.map.tasks=20 -Dmapred.reduce.tasks=0 Matrix/test4.txt Result 3

A space after -D is allowed though.

Also note that changing the number of mappers is probably a bad idea as other people have mentioned here.

Pupiparous answered 2/9, 2015 at 12:40 Comment(0)

Number of map tasks is directly defined by number of chunks your input is splitted. The size of data chunk (i.e. HDFS block size) is controllable and can be set for an individual file, set of files, directory(-s). So, setting specific number of map tasks in a job is possible but involves setting a corresponding HDFS block size for job's input data. mapred.map.tasks can be used for that too but only if its provided value is greater than number of splits for job's input data.

Controlling number of reducers via mapred.reduce.tasks is correct. However, setting it to zero is a rather special case: the job's output is an concatenation of mappers' outputs (non-sorted). In Matt's answer one can see more ways to set the number of reducers.

Benzidine answered 21/2, 2017 at 13:6 Comment(0)

One way you can increase the number of mappers is to give your input in the form of split files [you can use linux split command]. Hadoop streaming usually assigns that many mappers as there are input files[if there are a large number of files] if not it will try to split the input into equal sized parts.

Novelette answered 13/8, 2013 at 17:5 Comment(0)

Use -D property=value rather than -D property = value (eliminate extra whitespaces). Thus -D mapred.reduce.tasks=value would work fine.
Setting number of map tasks doesnt always reflect the value you have set since it depends on split size and InputFormat used.
Setting the number of reduces will definitely override the number of reduces set on cluster/client-side configuration.

Fishery answered 28/3, 2014 at 20:41 Comment(0)

I agree the number mapp task depends upon the input split but in some of the scenario i could see its little different

case-1 I created a simple mapp task only it creates 2 duplicate out put file (data ia same) command I gave below

bin/hadoop jar contrib/streaming/hadoop-streaming-1.2.1.jar -D mapred.reduce.tasks=0 -input /home/sample.csv -output /home/sample_csv112.txt -mapper /home/amitav/workpython/readcsv.py

Case-2 So I restrcted the mapp task to 1 the out put came correctly with one output file but one reducer also lunched in the UI screen although I restricted the reducer job. The command is given below.

bin/hadoop jar contrib/streaming/hadoop-streaming-1.2.1.jar -D mapred.map.tasks=1 mapred.reduce.tasks=0 -input /home/sample.csv -output /home/sample_csv115.txt -mapper /home/amitav/workpython/readcsv.py

Mcanally answered 4/10, 2014 at 22:12 Comment(0)

The first part has already been answered, "just a suggestion" The second part has also been answered, "remove extra spaces around =" If both these didnt work, are you sure you have implemented ToolRunner ?

Kablesh answered 8/4, 2015 at 18:8 Comment(0)

Number of map task depends on File size, If you want n number of Map, divide the file size by n as follows:

conf.set("mapred.max.split.size", "41943040"); // maximum split file size in bytes
conf.set("mapred.min.split.size", "20971520"); // minimum split file size in bytes

Belshazzar answered 3/3, 2016 at 9:29 Comment(0)

-2

Folks from this theory it seems we cannot run map reduce jobs in parallel.

Lets say I configured total 5 mapper jobs to run on particular node.Also I want to use this in such a way that JOB1 can use 3 mappers and JOB2 can use 2 mappers so that job can run in parallel. But above properties are ignored then how can execute jobs in parallel.

Monamonachal answered 29/1, 2014 at 10:29 Comment(0)

-2

From what I understand reading above, it depends on the input files. If Input Files are 100 means - Hadoop will create 100 map tasks. However, it depends on the Node configuration on How Many can be run at one point of time. If a node is configured to run 10 map tasks - only 10 map tasks will run in parallel by picking 10 different input files out of the 100 available. Map tasks will continue to fetch more files as and when it completes processing of a file.

Imparipinnate answered 12/2, 2014 at 11:58 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags