Group by multiple fields and output tuple
Asked Answered
C

1

9

I have a feed in the following format:

Hour Key  ID  Value
   1  K1 001      3
   1  K1 002      2
   2  K1 005      4
   1  K2 002      1
   2  K2 003      5
   2  K2 004      6

and I want to group the feed by (Hour, Key) then sum the Value but keep ID as a tuple:

({1, K1}, {001, 002}, 5)
({2, K1}, {005}, 4)
({1, K2}, {002}, 1)
({2, K2}, {003, 004}, 11)

I know how to use FLATTEN to generate the sum of the Value but don't know how to output ID as a tuple. This is what I have so far:

A = LOAD 'data' AS (Hour:chararray, Key:chararray, ID:chararray, Value:int);
B = GROUP A BY (Hour, Key);
C = FOREACH B GENERATE
    FLATTEN(group) AS (Hour, Key),
    SUM(A.Value) AS Value
;

Will you explain how to do this? Appreciate it!

Canoewood answered 19/6, 2013 at 8:33 Comment(2)
Can you provide the final state / output you are looking for - not sure i understand what you mean by "output ID as a tuple"Reannareap
@Chris, the final state would be somehow like the piece I have above: (Hour, Key) as the key, all its associated ID(s) as in a tuple (like {001, 002}, etc.) plus the sum of Value. Please let me know if it makes sense. Thanks!Canoewood
M
14

You just need to use the bag projection operator, .. This will create a new bag where the tuples have just the element(s) you specify. In your case, use A.ID. In fact, you are already using this operator to provide the input to SUM -- the input to sum is a bag of single-element tuples, which you create by projecting the Value field.

A = LOAD 'data' AS (Hour:chararray, Key:chararray, ID:chararray, Value:int);
B = GROUP A BY (Hour, Key);
C = FOREACH B GENERATE
    FLATTEN(group) AS (Hour, Key),
    A.ID,
    SUM(A.Value) AS Value
;
Malady answered 19/6, 2013 at 15:40 Comment(1)
Thank you very much, Winnie. This is great. Guess I still need a better grasp how bag works in Pig. Thanks!Canoewood

© 2022 - 2024 — McMap. All rights reserved.