Let's say I have a table such as the one below, that may or may not contain duplicates for a given field:
ID URL
--- ------------------
001 http://example.com/adam
002 http://example.com/beth
002 http://example.com/beth?extra=blah
003 http://example.com/charlie
I would like to write a Pig script to find only DISTINCT rows, based on the value of a single field. For instance, filtering the table above by ID
should return something like the following:
ID URL
--- ------------------
001 http://example.com/adam
002 http://example.com/beth
003 http://example.com/charlie
The Pig GROUP BY
operator returns a bag of tuples grouped by ID, which would work if I knew how to get just the first tuple per bag (perhaps a separate question).
The Pig DISTINCT
operator works on the entire row, so in this case all four rows would be considered unique, which is not what I want.
For my purposes, I do not care which of the rows with ID 002
are returned.