Removing duplicates using PigLatin
Asked Answered
P

2

9

I'm using PigLatin to filter some records.

User1  8 NYC 
User1  9 NYC 
User1  7 LA 
User2  4 NYC
User2  3 DC 

The script should remove the duplicate for users, and keep one of these records. Something like the unique command in linux.

The output should be:

User1 8 NYC 
User2 4 NYC

Any suggestions?

Planetstruck answered 18/7, 2012 at 3:50 Comment(0)
F
20

For your particular example distinct will not work well as your output contains all of the input columns ($0, $1, $2), you can do distinct only on a projection that has columns ($0, $2) or ($0) and lose $1.

In order to select one record per user (any record) you could use a GROUP BY and a nested FOREACH with LIMIT. Ex:

inpt = load '......' ......;
user_grp = GROUP inpt BY $0;
filtered = FOREACH user_grp {
      top_rec = LIMIT inpt 1;
      GENERATE FLATTEN(top_rec);
};

This approach will help you get records that are unique on a subset of fields and also limit number of output records per each user, which you can control.

Fibroma answered 19/7, 2012 at 8:30 Comment(0)
O
0

Pig provide DISTINCT command to select unique data. If you want use distinct on fields Use Distinct in foreach nested block.

Older answered 19/7, 2012 at 5:0 Comment(1)
Be careful while using Distinct ..The drawback with DISTINCT keyword is : You cannot be sure that only first record will be removed.Buttock

© 2022 - 2024 — McMap. All rights reserved.