I was under the impression that combiners are just like reducers that act on the local map task, That is it aggregates the results of individual Map task in order to reduce the network bandwidth for output transfer.
And from reading Hadoop- The definitive guide 3rd edition
, my understanding seems correct.
From chapter 2 (page 34)
Combiner Functions Many MapReduce jobs are limited by the bandwidth available on the cluster, so it pays to minimize the data transferred between map and reduce tasks. Hadoop allows the user to specify a combiner function to be run on the map output—the combiner func- tion’s output forms the input to the reduce function. Since the combiner function is an optimization, Hadoop does not provide a guarantee of how many times it will call it for a particular map output record, if at all. In other words, calling the combiner function zero, one, or many times should produce the same output from the reducer.
So I tried the following on the wordcount problem:
job.setMapperClass(mapperClass);
job.setCombinerClass(reduceClass);
job.setNumReduceTasks(0);
Here is the counters:
14/07/18 10:40:15 INFO mapred.JobClient: Counters: 10
14/07/18 10:40:15 INFO mapred.JobClient: File System Counters
14/07/18 10:40:15 INFO mapred.JobClient: FILE: Number of bytes read=293
14/07/18 10:40:15 INFO mapred.JobClient: FILE: Number of bytes written=75964
14/07/18 10:40:15 INFO mapred.JobClient: FILE: Number of read operations=0
14/07/18 10:40:15 INFO mapred.JobClient: FILE: Number of large read operations=0
14/07/18 10:40:15 INFO mapred.JobClient: FILE: Number of write operations=0
14/07/18 10:40:15 INFO mapred.JobClient: Map-Reduce Framework
14/07/18 10:40:15 INFO mapred.JobClient: Map input records=7
14/07/18 10:40:15 INFO mapred.JobClient: Map output records=16
14/07/18 10:40:15 INFO mapred.JobClient: Input split bytes=125
14/07/18 10:40:15 INFO mapred.JobClient: Spilled Records=0
14/07/18 10:40:15 INFO mapred.JobClient: Total committed heap usage (bytes)=85000192
and here is part-m-00000
:
hello 1
world 1
Hadoop 1
programming 1
mapreduce 1
wordcount 1
lets 1
see 1
if 1
this 1
works 1
12345678 1
hello 1
world 1
mapreduce 1
wordcount 1
so clearly no combiner is applied. I understand that Hadoop does not guarantee if a combiner will be called at all. But when I turn on the reduce phase, the combiner gets called.
WHY IS THIS BEHAVIOR?
Now when I read chapter 6 (page 208) on how MapReduce works
. I see this paragraph described in the Reduce side
.
The map outputs are copied to the reduce task JVM’s memory if they are small enough (the buffer’s size is controlled by mapred.job.shuffle.input.buffer.percent, which specifies the proportion of the heap to use for this purpose); otherwise, they are copied to disk. When the in-memory buffer reaches a threshold size (controlled by mapred.job.shuffle.merge.percent), or reaches a threshold number of map outputs (mapred.inmem.merge.threshold), it is merged and spilled to disk. If a combiner is specified it will be run during the merge to reduce the amount of data written to disk.
My inferences from this paragraph are : 1) Combiner is ALSO run during the reduce phase.