Pig: Get top n values per group
Asked Answered
C

2

17

I have data that's already grouped and aggregated, it looks like so:

user    value      count
----    --------  ------
Alice   third      5
Alice   first      11
Alice   second     10
Alice   fourth     2
...
Bob     second     20
Bob     third      18
Bob     first      21
Bob     fourth     8
...

For every user (Alice and Bob), I want retrieve their top n values (let's say 2), sorted terms of 'count'. So the desired output I want is this:

Alice first 11
Alice second 10
Bob first 21
Bob second 20

How can I accomplish that?

Clamp answered 15/7, 2013 at 13:56 Comment(0)
H
29

One approach is

records = LOAD '/user/nubes/ncdc/micro-tab/top.txt' AS (user:chararray,value:chararray,counter:int);
grpd = GROUP records BY user;

top3 = foreach grpd {
        sorted = order records by counter desc;
        top    = limit sorted 2;
        generate group, flatten(top);
};

Input is:

Alice   third   5 
Alice   first   11 
Alice   second  10
Alice   fourth  2
Bob second  20
Bob third   18
Bob first   21
Bob fourth  8

Output is:

(Alice,Alice,first,11)
(Alice,Alice,second,10
(Bob,Bob,first,21)
(Bob,Bob,second,20)
Hawes answered 15/7, 2013 at 14:33 Comment(2)
limit does not guarantee that you will get the top records.Beltz
The docs say that doing a limit after you've ordered the collection does guarantee you get the top recordsWyckoff
D
6

I have just made an observation that

top    = limit sorted 2;

top is an inbuilt function and may throw an error so the only thing which I did was changed the name of the relation in this case and instead of

generate group, flatten(top);

which was giving the output

(Alice,Alice,first,11)
(Alice,Alice,second,10
(Bob,Bob,first,21)
(Bob,Bob,second,20)

Amended that as shown below -

records = load 'test1.txt' using PigStorage(',') as (user:chararray, value:chararray, count:int);
grpd = GROUP records BY user;
top2 = foreach grpd {
        sorted = order records by count desc;
        top1    = limit sorted 2;
        generate flatten(top1);
};

which gave me the desired output as required by you -

(Alice,first,11)
(Alice,second,10)
(Bob,first,21)
(Bob,second,20)

Hope this helps.

Domingo answered 14/4, 2014 at 21:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.