is the output of map phase of the mapreduce job always sorted?

Asked 16/7, 2014 at 1:54 Answered 22/5, 2016 at 8:53

I am a bit confused with the output I get from Mapper.

For example, when I run a simple wordcount program, with this input text:

hello world
Hadoop programming
mapreduce wordcount
lets see if this works
12345678
hello world
mapreduce wordcount

this is the output that I get:

12345678    1
Hadoop  1
hello   1
hello   1
if  1
lets    1
mapreduce   1
mapreduce   1
programming 1
see 1
this    1
wordcount   1
wordcount   1
works   1
world   1
world   1

As you can see, the output from mapper is already sorted. I did not run Reducer at all. But I find in a different project that the output from mapper is not sorted. So I am totally clear about this..

My questions are:

Is the mapper's output always sorted?
Is the sort phase integrated into the mapper phase already, so that the output of map phase is already sorted in the intermediate data?
Is there a way to collect the data from sort and shuffle phase and persist it before it goes to Reducer? A reducer is presented with a key and a list of iterables. Is there a way, I could persist this data?

Belia answered 16/7, 2014 at 1:54 Comment(0)

Is the mapper's output always sorted?

No. It is not sorted if you use no reducer. If you use a reducer, there is a pre-sorting process before the mapper's output is written to disk. Data gets sorted in the Reduce phase. What is happening here (just a guess) is that you are not specifying a Reducer class, which, in the new API, is translated into using the Identity Reducer (see this answer and comment). The Identity Reducer just outputs its input. To verify that, see the default Reducer counters (there should be some reduce tasks, reduce input records & groups, reduce output records...)

Is the sort phase integrated into the mapper phase already, so that the output of map phase is already sorted in the intermediate data?

As I explained in the previous question, if you use no reducers, mapper does not sort the data. If you do use reducers, the data start getting sorted from the map phase and then get merge-sorted in the reduce phase.

Is there a way to collect the data from sort and shuffle phase and persist it before it goes to Reducer. A reducer is presented with a key and a list of iterables. Is there a way, I could persist this data?

Again, shuffling and sorting are parts of the Reduce phase. An Identity Reducer will do what you want. If you want to output one key-value pair per reducer, with the values being a concatenation of the iterables, just store the iterables in memory (e.g. in a StringBuffer) and then output this concatenation as a value. If you want the map output to go straight to the program's output, without going through a reduce phase, then set in the driver class the number of reduce tasks to zero, like that:

job.setNumReduceTasks(0);

This will not get your output sorted, though. It will skip the pre-sorting process of the mapper and write the output directly to HDFS.

Headsman answered 17/7, 2014 at 8:22 Comment(9)

so both the answers below are misleading. If the data is sorted in Reducer (or Identity Reducer in new API), how can I just persist the data from mapper (without it going through Identity Reducer..) – Belia 17/7, 2014 at 18:5

By setting the number of reduce tasks to 0 in the Driver class: job.setNumReduceTasks(0); This way, of course, your output will not be sorted. I updated my answer to include this option – Headsman 17/7, 2014 at 18:30

Thanks! so one question again: so is shuffling and partitioning same? i.e unsorted key-value output pairs from Mapper gets send to its respective reducer based on the hashcode of the key. And the reducer will combine values of same keys before it passed to reduce method..can we say sort and grouping happens in Reducer. but not shuffling/partition? – Belia 17/7, 2014 at 18:44

One more question: I set job.setNumReduceTask(0). but I also turned on job.setCombinerClass(Reducerclass). I did not see any combining happening? any reason why it is so? – Belia 17/7, 2014 at 18:45

Shuffling is actually the process in which the reduer copies the data that it should process. Partitioning is deciding to which reducer we will send our data. – Headsman 18/7, 2014 at 6:58

Concerning the other question, I don't know if you can have combiners without reducers. Please, add these questions as new, or see this post: #22174288 – Headsman 18/7, 2014 at 6:59

Judging from your new question [#24831428, I guess that this question is solved? – Headsman 19/7, 2014 at 7:3

This is not solved totally. I read the Hadoop book and it says that output is sorted at Mapper side. I am thinking of posting another question. But I will accept this after I post that. Thanks a lot – Belia 20/7, 2014 at 4:15

@brainstorm with 3 years delay, I noticed that I hadn't notify you for my updated answer (just for the record) :) You were right. – Headsman 24/5, 2017 at 7:24

Point 1: output from mapper is always sorted but based on Key. i.e. if Map method is doing this: context.write(outKey, outValue); then result will be sorted based on outKey.

Whortleberry answered 16/7, 2014 at 7:4 Comment(0)

Following would be some explanations to your questions

Heading ##Does the output from mapper is always sorted?

Already answered by @SurJanSR
Heading ##Does the sort phase integrated with mapper phase already, so that the output of map phase is already sorted in the intermediate data?

In a Mapreduce Job, as you know, Mapper runs on individual splits of data and across nodes where data is persisting. The result of Mapper is written TEMPORARILY before it is written to the next phase.
In the case of a reduce operation, the TEMPORARILY stored Mapper output is sorted, shuffle based on the partitioner needs before moved to the reduce operation
In the case of Map Only Job, as in your case, The temorarily stored Mapper output is sorted based on the key and written to the final output folder (as specified in your arguments for the Job).
Heading ##Is there a way to collect the data from sort and shuffle phase and persist it before it goes to Reducer. A reducer is presented with a key and a list of iterables. Is there a way, I could persist this data?

Not sure what your requirement is. using a IdentityReducer would just persist the output. I'm not sure if this answers your question.

Mourner answered 16/7, 2014 at 19:44 Comment(1)

I think the TEMPORARILY stored map-out file is sorted already (sorting happens in memory and spills to disk occurs if it exceeds memory) – Belia 16/7, 2014 at 19:54

I support the answer of vefthym. Usually the Mapper output is sorted before storing it locally on the node. But when you are explicitely setting up numReduceTasks to 0 in the job configuration then the mapper o/p will not be sorted and written directly to HDFS. So we cannot say that Mapper output is always sorted!

Hiles answered 22/5, 2016 at 0:27 Comment(0)

1. Is the mapper's output always sorted?

2.Is the sort phase integrated into the mapper phase already, so that the output of map phase is already sorted in the intermediate data?

From Apache MapReduceTutorial:

( Under Mapper Section )

All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to the Reducer(s) to determine the final output.

The Mapper outputs are sorted and then partitioned per Reducer. The total number of partitions is the same as the number of reduce tasks for the job

( Under Reducer Section )

Reducer NONE

It is legal to set the number of reduce-tasks to zero if no reduction is desired.

In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by FileOutputFormat.setOutputPath(Job, Path). The framework does not sort the map-outputs before writing them out to the FileSystem.

3. Is there a way to collect the data from sort and shuffle phase and persist it before it goes to Reducer? A reducer is presented with a key and a list of iterables. Is there a way, I could persist this data?

I don't think so. From Apache condemnation on Reducer:

Reducer has 3 primary phases:

Shuffle:

The Reducer copies the sorted output from each Mapper using HTTP across the network.

Sort: The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key).

The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.

Reduce:

The output of the reduce task is typically written to a RecordWriter via TaskInputOutputContext.write(Object, Object).

The output of the Reducer is not re-sorted.

As per the documentation, the shuffle and sort phase is driven by framework

If you want to persist the data, set number of reducers to Zero, which causes persistence of Map output into HDFS but it won't sort the data.

Have a look at related SE question:

hadoop: difference between 0 reducer and identity reducer?

I did not find IdentityReducer in Hadoop 2.x version:

identityreducer in the new Hadoop API

Sunshade answered 22/5, 2016 at 8:53 Comment(0)

Heading ##Does the output from mapper is always sorted?

Heading ##Does the sort phase integrated with mapper phase already, so that the output of map phase is already sorted in the intermediate data?

Heading ##Is there a way to collect the data from sort and shuffle phase and persist it before it goes to Reducer. A reducer is presented with a key and a list of iterables. Is there a way, I could persist this data?

Recommended topics

Hot tags