Here is an example for grouping. Consider a composite key (a, b)
and its value v
. And let's assume that after sorting you end up, among others, with the following group of (key, value) pairs:
(a1, b11) -> v1
(a1, b12) -> v2
(a1, b13) -> v3
With the default group comparator the framework will call the reduce
function 3 times with respective (key, value) pairs, since all keys are different. However, if you provide your own custom group comparator, and define it so that it depends only on a
, ignoring b
, then the framework concludes that all keys in this group are equal and calls the reduce function only once using the following key and the list of values:
(a1, b11) -> <v1, v2, v3>
Note that only the first composite key is used, and that b12 and b13 are "lost", i.e., not passed to the reducer.
In the well known example from the "Hadoop" book computing the max temperature by year, a
is the year, and b
's are temperatures sorted in descending order, thus b11 is the desired max temperature and you don't care about other b
's. The reduce function just writes the received (a1, b11) as the solution for that year.
In your example from "bigdataspeak.com" all b
's are required in the reducer, but they are available as parts of respective values (objects) v
.
In this way, by including your value or its part in the key, you can use Hadoop to sort not only your keys, but also your values.
Hope this helps.