How to flatten a group into a single tuple in Pig?
Asked Answered
I

3

8

From this:

(1, {(1,2), (1,3), (1,4)} )
(2, {(2,5), (2,6), (2,7)} )

...How could we generate this?

((1,2),(1,3),(1,4))
((2,5),(2,6),(2,7))

...And how could we generate this?

(1, 2, 3, 4)
(2, 5, 6, 7)

For a single row I know how to do. The problem is when I have to iterate over many rows AND manipulate internal groups at the same time.

Interstitial answered 31/8, 2013 at 4:48 Comment(0)
M
11

For your question, I prepared the following file:

1,2
1,3
1,4
2,5
2,6
2,7

At first, I used the following script to get the input r3 which you described in your question:

r1 = load 'test_file' using PigStorage(',') as (a:int, b:int);
r2 = group r1 by a;
r3 = foreach r2 generate group as a, r1 as b;
describe r3;
-- r3: {a: int,b: {(a: int,b: int)}}
-- r3 is like (1, {(1,2), (1,3), (1,4)} )

If we want to generate the following content,

(1, 2, 3, 4)
(2, 5, 6, 7)

we can use the following script:

r4 = foreach r3 generate a, FLATTEN(BagToTuple(b.b));
dump r4;

For the following content,

((1,2),(1,3),(1,4))
((2,5),(2,6),(2,7))

I can not find any helpful builtin function. Maybe you need to write your custom BagToTuple. Here is the builtin BagToTuple source codes: http://www.grepcode.com/file/repo1.maven.org/maven2/org.apache.pig/pig/0.11.1/org/apache/pig/builtin/BagToTuple.java#BagToTuple.getOuputTupleSize%28org.apache.pig.data.DataBag%29

Mcneill answered 31/8, 2013 at 8:6 Comment(1)
What if there are more than one field,1,2,3 1,3,4 1,4,5 2,5,6 2,6,7 2,7,8 and we want the output as (1, 2,3,3,4,4,5) (2, 5,6, 6,7, 7,8)Mott
Z
4

In order to obtain :

((1,2),(1,3),(1,4))
((2,5),(2,6),(2,7))

You can do this :

r4 = foreach r3 {
    Tmp=foreach $1 generate (a,b);
    generate FLATTEN(BagToTuple(Tmp));
};
Zulmazulu answered 24/4, 2014 at 9:52 Comment(1)
Amazing solution! why doesn't r4 = foreach r3 generate BagToTuple(b) work? It gives me ((1,4,1,3,1,2)) ((2,7,2,6,2,5)), which seems irrational.Orfield
H
3

There is no builtin way to convert a bag to a tuple. This is because bags are unordered sets of tuples, so Pig doesn't know what order that the tuples should be set to when it is converted into a tuple. This means that you'll have to write a UDF to do this.

I'm not sure how you are creating the (1, 2, 3, 4) tuple, but this is another good candidate for a UDF, even though you could create that schema with just the BagToTuple UDF.

NOTE: You probably shouldn't be turning anything into a tuple unless you know exactly how many fields there are.

myudfs.py

#!/usr/bin/python

@outputSchema('T:(T1:(a1:chararray, a2:chararray), T2:(b1:chararray, b2:chararray), T3:(c1:chararray, c2:chararray))')
def BagToTuple(B):
    return tuple(B)

def generateContent(B):
    foo = [B[0][0]] + [ t[1] for t in B ]
    return tuple(foo)

myscript.pig

REGISTER 'myudfs.py' USING jython AS myudfs ; 

-- A is (1, {(1,2), (1,3), (1,4)} ) 
-- The schema is (I:int, B:{T:(I1:int, I2:int)})

B = FOREACH A GENERATE myudfs.BagToTuple(B) ;
C = FOREACH A GENERATE myudfs.generateContent(B) ;
Heather answered 31/8, 2013 at 13:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.