In Apache Pig, select DISTINCT rows based on a single column

Asked 27/5, 2014 at 23:54 Answered 14/3, 2017 at 11:43

Let's say I have a table such as the one below, that may or may not contain duplicates for a given field:

ID     URL
---    ------------------
001    http://example.com/adam
002    http://example.com/beth
002    http://example.com/beth?extra=blah
003    http://example.com/charlie

I would like to write a Pig script to find only DISTINCT rows, based on the value of a single field. For instance, filtering the table above by ID should return something like the following:

ID     URL
---    ------------------
001    http://example.com/adam
002    http://example.com/beth
003    http://example.com/charlie

The Pig GROUP BY operator returns a bag of tuples grouped by ID, which would work if I knew how to get just the first tuple per bag (perhaps a separate question).

The Pig DISTINCT operator works on the entire row, so in this case all four rows would be considered unique, which is not what I want.

For my purposes, I do not care which of the rows with ID 002 are returned.

Respond answered 27/5, 2014 at 23:54 Comment(0)

I found one way to do this, using the GROUP BY and the TOP operators:

my_table = LOAD 'my_table_file' AS (A, B);

my_table_grouped = GROUP my_table BY A;

my_table_distinct = FOREACH my_table_grouped {

    -- For each group $0 refers to the group name, (A)
    -- and $1 refers to a bag of entire rows {(A, B), (A, B), ...}.
    -- Here, we take only the first (top 1) row in the bag:

    result = TOP(1, 0, $1);
    GENERATE FLATTEN(result);

}

DUMP my_table_distinct;

This results in one distinct row per ID column:

(001,http://example.com/adam)
(002,http://example.com/beth?extra=blah)
(003,http://example.com/charlie)

I don't know if there is a better approach, but this works for me. I hope this helps others starting out with Pig.

(Reference: http://pig.apache.org/docs/r0.12.1/func.html#topx)

Respond answered 28/5, 2014 at 0:51 Comment(0)

I have found that you can do this with a nested grouping and using LIMIT So using Arel's example:

my_table = LOAD 'my_table_file' AS (A, B);

-- Nested foreach grouping generates bags with same A,
-- limit bags to 1

my_table_distinct = FOREACH (GROUP my_table BY A) {
  result = LIMIT my_table 1;
  GENERATE FLATTEN(result);
}

DUMP my_table_distinct;

Surinam answered 16/10, 2015 at 20:21 Comment(0)

You can use

Apache DataFu™ (incubating)

FirstTupleFrom Bag

register datafu-pig-incubating-1.3.1.jar
define FirstTupleFromBag datafu.pig.bags.FirstTupleFromBag();
my_table_grouped = GROUP my_table BY A;
my_table_grouped_first_tuple = foreach my_table_grouped generate flatten(FirstTupleFromBag(my_table,null));

Delmore answered 14/3, 2017 at 11:43 Comment(0)

Recommended topics

Hot tags