How to perform a DISTINCT in Pig Latin on a subset of columns?

Asked 25/9, 2013 at 22:39 Answered 7/4, 2020 at 1:39

I would like to perform a DISTINCT operation on a subset of the columns. The documentation says this is possible with a nested foreach:

You cannot use DISTINCT on a subset of fields; to do this, use FOREACH and a nested block to first select the fields and then apply DISTINCT (see Example: Nested Block).

It is simple to perform a DISTINCT operation on all of the columns:

A = LOAD 'data' AS (a1,a2,a3,a4);
A_unique = DISTINCT A;

Lets say that I am interested in performing the distinct across a1, a2, and a3. Can anyone provide an example showing how to perform this operation with a nested foreach as suggested in the documentation?

Here's an example of input and expected output:

A = LOAD 'data' AS(a1,a2,a3,a4);
DUMP A;

(1 2 3 4)
(1 2 3 4)
(1 2 3 5)
(1 2 4 4)

-- insert DISTINCT operation on a1,a2,a3 here:
-- ...

DUMP A_unique;

(1 2 3 4)
(1 2 4 4)

Incondite answered 25/9, 2013 at 22:39 Comment(1)

Could you provide example input and expected output? – Mandler 25/9, 2013 at 22:56

Group on all the other columns, project just the columns of interest into a bag, and then use FLATTEN to expand them out again:

A_unique =
    FOREACH (GROUP A BY a4) {
        b = A.(a1,a2,a3);
        s = DISTINCT b;
        GENERATE FLATTEN(s), group AS a4;
    };

Theiss answered 25/9, 2013 at 23:34 Comment(2)

Very nice example! Thanks for the help. This is really similar to my answer using distinct, but is much more concise. – Incondite 25/9, 2013 at 23:47

how can we enable parallel clause on it. I tried using DEFAULT_PARALLEL 512 in my script but that doesn't work and #of reducers are not limited to 512. Any recommendations ? – Salivation 23/6, 2017 at 21:25

The accepted answer is one great solution but, in case you want to reorder the fields in the output (something I had to do recently) this might not work. Here's an alternative:

A = LOAD '$input' AS (f1, f2, f3, f4, f5);
GP = GROUP A BY (f1, f2, f3);
OUTPUT = FOREACH GP GENERATE 
    group.f1, group.f2, f4, f5, group.f3 ;

When you group on certain fields, the selection would have unique values for the group in a each tuple.

Kassandra answered 13/2, 2015 at 10:45 Comment(0)

For your specified input/output, the following works. You might update your test vectors to clarify what you need that is different than this.

A_unique = DISTINCT A;

Cycle answered 18/1, 2014 at 13:41 Comment(1)

Yeah that might help :). I updated the expected output in the original question. – Incondite 20/1, 2014 at 23:14

Here are 2 possible solutions, are there any other good approaches?

Solution 1 (using LIMIT 1):

A = LOAD 'test_data' AS (a1,a2,a3,a4);

-- Combine the columns that I want to perform the distinct across into a tuple
A2 = FOREACH A GENERATE TOTUPLE(a1,a2,a3) AS combined, a4 as a4

-- Group by the combined column
grouped_by_a4 = GROUP A2 BY combined;

grouped_and_distinct = FOREACH grouped_by_a4 {
        single = LIMIT A2 1;
        GENERATE FLATTEN(single);
};

Solution 2 (using DISTINCT):

A = LOAD 'test_data' AS (a1,a2,a3,a4);

-- Combine the columns that I want to perform the distinct across into a tuple
A2 = FOREACH A GENERATE TOTUPLE(a1,a2,a3) AS combined, a4 as a4

-- Group by the other columns (those I don't want the distinct applied to)
grouped_by_a4 = GROUP A2 BY a4;

-- Perform the distinct on a projection of combined and flatten 
grouped_and_distinct = FOREACH grouped_by_a4 {
        combined_unique = DISTINCT A2.combined;
        GENERATE FLATTEN(combined_unique);
};

Incondite answered 25/9, 2013 at 23:34 Comment(1)

These solutions are not equivalent. The first one will only give you two tuples, as there are only two unique combinations of a1, a2, and a3, and the value for a4 is not predictable. The second will give output consistent with your sample output. (It's not as concise as it could be, though.) – Theiss 25/9, 2013 at 23:44

unique_A = FOREACH (GROUP A BY (a1, a2, a3)) {
    limit_a = LIMIT A 1;
    GENERATE FLATTEN(limit_a) AS (a1,a2,a3,a4);
};

Mishmash answered 24/2, 2016 at 1:17 Comment(0)

I was looking to do the same: "I would like to perform a DISTINCT operation on a subset of the columns". The way I did it was:

A = LOAD 'data' AS(a1,a2,a3,a4);
interested_fields = FOREACH A GENERATE a1,a2,a3;
distinct_fields= DISTINCT interested_fields;
final_answer = FOREACH distinct_fields GENERATE FLATTEN($0);

I know it's not an example of how to perform a nested foreach as suggested in the documentation; but it's a way of doing a distinct over a subset of fields. Hope It helps to anyone who gets here just like I did.

Vince answered 7/4, 2020 at 1:39 Comment(0)

Recommended topics

Hot tags