hadoop map reduce secondary sorting

A

5

22

Can any one explain me how secondary sorting works in hadoop ?
Why must one use GroupingComparator and how does it work in hadoop ?

I was going through the link given below and got doubt on how groupcomapator works.
Can any one explain me how grouping comparator works?

http://www.bigdataspeak.com/2013/02/hadoop-how-to-do-secondary-sort-on_25.html

Astronomer answered 23/8, 2013 at 6:14 Comment(1)

Your link is broken. – Impede 16/11, 2020 at 17:49

D

16

Grouping Comparator

Once the data reaches a reducer, all data is grouped by key. Since we have a composite key, we need to make sure records are grouped solely by the natural key. This is accomplished by writing a custom GroupPartitioner. We have a Comparator object only considering the yearMonth field of the TemperaturePair class for the purposes of grouping the records together.

public class YearMonthGroupingComparator extends WritableComparator {

    public YearMonthGroupingComparator() {
        super(TemperaturePair.class, true);
    }

    @Override
    public int compare(WritableComparable tp1, WritableComparable tp2) {
        TemperaturePair temperaturePair = (TemperaturePair) tp1;
        TemperaturePair temperaturePair2 = (TemperaturePair) tp2;
        return temperaturePair.getYearMonth().compareTo(temperaturePair2.getYearMonth());
    }
}

Here are the results of running our secondary sort job:

new-host-2:sbin bbejeck$ hdfs dfs -cat secondary-sort/part-r-00000

190101 -206

190102 -333

190103 -272

190104 -61

190105 -33

190106 44

190107 72

190108 44

190109 17

190110 -33

190111 -217

190112 -300

While sorting data by value may not be a common need, it’s a nice tool to have in your back pocket when needed. Also, we have been able to take a deeper look at the inner workings of Hadoop by working with custom partitioners and group partitioners. Refer this link also..What is the use of grouping comparator in hadoop map reduce

Disquieting answered 23/8, 2013 at 6:28 Comment(3)

how secondary sorting works internally?what is the actual flow from mapper to reducer? – Astronomer 23/8, 2013 at 8:36

For understanding...refer this link answers.oreilly.com/topic/… – Disquieting 23/8, 2013 at 8:53

mapper->partitioner->sort based on secondary value->group based on natural key->reducer...am i correct – Astronomer 23/8, 2013 at 8:59

E

49

I find it easy to understand certain concepts with help of diagrams and this is certainly one of them.

Lets assume that our secondary sorting is on a composite key made out of Last Name and First Name.

Composite Key

With the composite key out of the way, now lets look at the secondary sorting mechanism

Secondary Sorting Steps

The partitioner and the group comparator use only natural key, the partitioner uses it to channel all records with the same natural key to a single reducer. This partitioning happens in the Map Phase, data from various Map tasks are received by reducers where they are grouped and then sent to the reduce method. This grouping is where the group comparator comes into picture, if we would not have specified a custom group comparator then Hadoop would have used the default implementation which would have considered the entire composite key, which would have lead to incorrect results.

Overview of MR steps

enter image description here

Evetteevey answered 15/4, 2014 at 8:30 Comment(4)

Great diagram, thanks. Is it possible to use this method without implementing your own groupingComparator? – Ruffner 4/9, 2016 at 4:25

Beautifully explained. Did you borrow this from any book? If yes, could you please point me to the book? Thanks. – Sidecar 25/10, 2016 at 23:40

Thanks. Yes it is from Hadoop in Practice. The second edition seems to have changed a fair bit. – Evetteevey 26/10, 2016 at 1:48

What is contained in the secondary part of the key, when the reduce() is called? Can I rely on the fact that this is the secondary key of the first value from values arg? – Torrens 7/6, 2018 at 12:25

D

16

Grouping Comparator

Once the data reaches a reducer, all data is grouped by key. Since we have a composite key, we need to make sure records are grouped solely by the natural key. This is accomplished by writing a custom GroupPartitioner. We have a Comparator object only considering the yearMonth field of the TemperaturePair class for the purposes of grouping the records together.

public class YearMonthGroupingComparator extends WritableComparator {

    public YearMonthGroupingComparator() {
        super(TemperaturePair.class, true);
    }

    @Override
    public int compare(WritableComparable tp1, WritableComparable tp2) {
        TemperaturePair temperaturePair = (TemperaturePair) tp1;
        TemperaturePair temperaturePair2 = (TemperaturePair) tp2;
        return temperaturePair.getYearMonth().compareTo(temperaturePair2.getYearMonth());
    }
}

Here are the results of running our secondary sort job:

new-host-2:sbin bbejeck$ hdfs dfs -cat secondary-sort/part-r-00000

190101 -206

190102 -333

190103 -272

190104 -61

190105 -33

190106 44

190107 72

190108 44

190109 17

190110 -33

190111 -217

190112 -300

While sorting data by value may not be a common need, it’s a nice tool to have in your back pocket when needed. Also, we have been able to take a deeper look at the inner workings of Hadoop by working with custom partitioners and group partitioners. Refer this link also..What is the use of grouping comparator in hadoop map reduce

Disquieting answered 23/8, 2013 at 6:28 Comment(3)

how secondary sorting works internally?what is the actual flow from mapper to reducer? – Astronomer 23/8, 2013 at 8:36

For understanding...refer this link answers.oreilly.com/topic/… – Disquieting 23/8, 2013 at 8:53

mapper->partitioner->sort based on secondary value->group based on natural key->reducer...am i correct – Astronomer 23/8, 2013 at 8:59

C

16

Here is an example for grouping. Consider a composite key (a, b) and its value v. And let's assume that after sorting you end up, among others, with the following group of (key, value) pairs:

(a1, b11) -> v1
(a1, b12) -> v2
(a1, b13) -> v3

With the default group comparator the framework will call the reduce function 3 times with respective (key, value) pairs, since all keys are different. However, if you provide your own custom group comparator, and define it so that it depends only on a, ignoring b, then the framework concludes that all keys in this group are equal and calls the reduce function only once using the following key and the list of values:

(a1, b11) -> <v1, v2, v3>

Note that only the first composite key is used, and that b12 and b13 are "lost", i.e., not passed to the reducer.

In the well known example from the "Hadoop" book computing the max temperature by year, a is the year, and b's are temperatures sorted in descending order, thus b11 is the desired max temperature and you don't care about other b's. The reduce function just writes the received (a1, b11) as the solution for that year.

In your example from "bigdataspeak.com" all b's are required in the reducer, but they are available as parts of respective values (objects) v.

In this way, by including your value or its part in the key, you can use Hadoop to sort not only your keys, but also your values.

Hope this helps.

Colorimeter answered 20/10, 2013 at 7:9 Comment(0)

G

1

A partitioner just ensures that one reducer receives all the records belonging to a key but it doesn’t change the fact that the reducer groups by key within the partition.

In case of secondary sort we form composite keys and if we let the default behavior to continue the grouping logic will consider the keys to be different.

So we need to control the grouping. Hence, we have to indicate to the framework to group based on natural part of key rather than the composite key. Hence grouping comparator has to be use for the same.

Guimpe answered 23/11, 2017 at 13:16 Comment(0)

C

0

Above mention examples have good explanation, let me simplify it.we need to perform three major steps.

Mapout should be (Key+Value, Value)
When we have joined Key&Value. Still we need to have mechanism to sort on original Key as well as on value.So we would add a custom comparator.
Now data is sorted on original Key but if we send this data to reducer, it will not guarantee to send all value of a given key to one reducer as we are using Key+Value as key. To make sure it we would add group comparator.

Caia answered 12/9, 2014 at 8:17 Comment(0)

Recommended topics

Hot tags