hadoop: difference between 0 reducer and identity reducer?

Asked 17/5, 2012 at 5:44 Answered 11/2, 2014 at 7:6

I am just trying to confirm my understanding of difference between 0 reducer and identity reducer.

0 reducer means reduce step will be skipped and mapper output will be the final out
Identity reducer means then shuffling/sorting will still take place?

Finley answered 17/5, 2012 at 5:44 Comment(0)

You understanding is correct. I would define it as following: If you do not need sorting of map results - you set 0 reduced,and the job is called map only.
If you need to sort the mapping results, but do not need any aggregation - you choose identity reducer.
And to complete the picture we have a third case : we do need aggregation and, in this case we need reducer.

Hembree answered 17/5, 2012 at 8:35 Comment(0)

Another use-case for using the Identity Reducer is to combine all the results into <# of reducers> output files. This can be handy if you are using Amazon Web Services to write to S3 directly, especially if the mapper output is small (e.g. a grep/search for a record), and you have a lot of mappers (e.g. 1000's).

Lonilonier answered 5/7, 2012 at 21:51 Comment(2)

Hi Dolan, could you elaborate a bit about using Identity Reducer to combine results into fewer files? I was facing similar problems -- having lots of small files generated by map-only jobs. Would it be less efficient compared to map-only jobs? – Floorboard 19/9, 2014 at 18:58

Yitong -- there is additional overhead when using the Identity Reducers over none at all because the Mapper outputs need to be hashed into X buckets and then sent to the X reducers (i.e. where X is your desired number of output files), sorted, and then saved to the output directory on HDFS/S3/etc. If you have a ton of data, then you'll need to be careful with this additional overhead because it can be significant in some cases. Alteratively, if saving to HDFS, you can use hdfs cat to stream all the files' output into one location. I don't know if S3 has a similar stream-reading mechanism. – Lonilonier 20/9, 2014 at 11:10

The main difference between "No Reducer" (mapred.reduce.tasks=0) and "Standard reducer" which is IdentityReducer (mapred.reduce.tasks=1 etc) is when you use "No reducer" there is no partitioning&shuffling processes after MAP stage. Therefore, in this case you will get 'pure' output from your mappers without any further processing. It helps for development and debugging puproses, but not only.

Caravaggio answered 11/2, 2014 at 7:6 Comment(0)

It depends on your business requirements. If you are doing a wordcount you should reduce your map output to get a total result. If you just want to change the words to upper case, you don't need a reduce.

Pika answered 17/5, 2012 at 8:17 Comment(0)

Recommended topics

Hot tags