Secondary Sort in Hadoop
Asked Answered
S

3

18

I am working on a hadoop project and after many visit to various blogs and reading the documentation, I realized I need to use secondary sort feature provided by hadoop framework.

My input format is of the form:

DESC(String) Price(Integer) and some other Text

I want the values in the reducer to be descending order of the Price. Also while comparing DESC I have a method which takes two strings and a Percentage and if similarity between the two strings equals or is greater than the percentage then I should consider them as equal.

The problem is after the Reduce Job is finished I can see some DESC which is similar to the other string and yet they are in different group.

Here is my compareTo method of Composite key:

public int compareTo(VendorKey o) {
    int result =-
    result = compare(token, o.token, ":") >= percentage ? 0:1;
    if (result == 0) {
        return pid> o.pid  ?-1: pid < o.pid ?1:0;
    }
    return result;
}

and compare method of Grouping Comparator:

public int compare(WritableComparable a, WritableComparable b) {
    VendorKey one = (VendorKey) a;
    VendorKey two = (VendorKey) b;
    int result = ClusterUtil.compare(one.getToken(), two.getToken(), ":") >= one.getPercentage() ? 0 : 1;
    // if (result != 0)
    // return two.getToken().compareTo(one.getToken());
    return result;
}
Similarity answered 4/8, 2016 at 16:54 Comment(1)
Did fixing the compareTo method work for you?Madian
M
0

It seems that your compareTo method violates the common contract that requires sgn(x.compareTo(y))to be equal to -sgn(y.compareTo(x)).

Madian answered 6/8, 2016 at 17:32 Comment(0)
H
0

After your customWritable, give one basic partitioner with a composite key and NullWritable value. For example:

public class SecondarySortBasicPartitioner extends
    Partitioner<CompositeKeyWritable, NullWritable> {

    public int getPartition(CompositeKeyWritable key, NullWritable value,
            int numReduceTasks) {

        return (key.DEPT().hashCode() % numReduceTasks);
    }
}

And after this specify Key sort comparator and with 2 compositeKeyWritable variables the grouping will be done.

Herriott answered 22/4, 2017 at 15:54 Comment(0)
S
0

There are 3 procedures during shuffle: Partitioning, Sorting and grouping. I guess that you have multiple reducers and your similar results were processed by different reducers for they are in different partitions.

You can set the number of reducers to 1 or set a custom Partitioner which extends org.apache.hadoop.mapreduce.Partitioner for you job.

Supposal answered 10/1, 2018 at 9:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.