I am a bit confused with the output I get from Mapper.
For example, when I run a simple wordcount program, with this input text:
hello world
Hadoop programming
mapreduce wordcount
lets see if this works
12345678
hello world
mapreduce wordcount
this is the output that I get:
12345678 1
Hadoop 1
hello 1
hello 1
if 1
lets 1
mapreduce 1
mapreduce 1
programming 1
see 1
this 1
wordcount 1
wordcount 1
works 1
world 1
world 1
As you can see, the output from mapper is already sorted. I did not run Reducer
at all.
But I find in a different project that the output from mapper is not sorted.
So I am totally clear about this..
My questions are:
- Is the mapper's output always sorted?
- Is the sort phase integrated into the mapper phase already, so that the output of map phase is already sorted in the intermediate data?
- Is there a way to collect the data from
sort and shuffle
phase and persist it before it goes to Reducer? A reducer is presented with a key and a list of iterables. Is there a way, I could persist this data?