Pig approach to pairing data fields in a data set

Asked 2/12, 2012 at 14:9 Answered 3/12, 2012 at 19:26

I'm new to Pig and trying to correctly implement a somewhat common algorithm in which I need to pair every matching record in a set of records. In order to distill the question into its simplest form and also avoid discussing some business-specific sensitivities, here's a mock problem:

Say that I have a dataset representing college classes and students that attend them:

Philosophy,John
English,Mary
English,Sue
History,Jack
Philosophy,David
English,Mark
English,Larry

I want to pair every association between students that took the same class; so the output would include this, showing the explosion of the four 'English' rows into six associations:

Philosphy   John,David
English    Mary,Sue
English    Mary,Mark
English    Mary,Larry
English    Sue,Mark
English    Sue,Larry
English    Mark,Larry

This page: http://ofps.oreilly.com/titles/9781449302641/advanced_pig_latin.html refers to using flatten() to effect the cross product. I have tried several approaches and researched this extensively and would post my attempts but honestly I'm flailing and I think that would just confuse the reader and not provide any value. But here's the boilerplate:

s = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
grp = group s by class;
...

(I believe the problem I'm facing has to do with flatten requiring multiple bags, not multiple fields, and I can't figure out how to get my group'ing to generate multiple bags...)

Thank you for any assistance!

Alit answered 2/12, 2012 at 14:9 Comment(0)

You can use the UnorderedPairs UDF from LinkedIn's Datafu project. Download the package from here and issue the followings (tested on Pig v0.10.0) :

register '/home/user/datafu/dist/datafu-0.0.4.jar'
define UnorderedPairs datafu.pig.bags.UnorderedPairs();
A = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
B = GROUP A BY class;
C = FOREACH B GENERATE group, FLATTEN(UnorderedPairs(A.student));

When further flattening the result:

D = FOREACH C generate FLATTEN($0) as (class:chararray), 
      FLATTEN($1) as (student1:chararray), FLATTEN($2) as (student2:chararray);

You'll end up having the desired result:

dump D;

(English,Mary,Sue)
(English,Mary,Mark)
(English,Mary,Larry)
(English,Sue,Mark)
(English,Sue,Larry)
(English,Mark,Larry)
(Philosophy,John,David)

Unicef answered 2/12, 2012 at 23:6 Comment(3)

Thanks Lorand, this looks really good and I'm going to test it. – Alit 5/12, 2012 at 12:39

One follow-up question: I initially resorted to Pig for this problem after trying it in a conventional map-reduce job written in Java. That failed mid-way through, I believe because one of the reducers had to process about 20,000 records -- which explodes to about 1E8 pairs, and this takes so long that the Job gives up waiting to hear from that reducer. I'm about to try this with your solution but am wondering if the result will be any different given that I have cases in which 60K records must be paired. Thanks. – Alit 5/12, 2012 at 12:49

When creating the result bag, this UDF will spill every million pairs to the disk so you hopefully won't run into memory issues. – Unicef 5/12, 2012 at 13:29

There are two approaches I see to this. I have not tried either in quite some time, so please follow up and let us know if they worked well or not.

The first approach is a self join

s1 = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
s2 = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
b = JOIN s1 BY class, s2 BY class;
...

The downside of this is that you have to load the data twice. There is some discussion on why this sucks, but it's just how you have to do it.

The other option would be to use CROSS nested in a FOREACH after the GROUP:

Note: I'm not sure at all if this will work, or if I got the syntax right (I'm not in an environment that I could test this right now). Perhaps someone can confirm.

B = GROUP s BY class;
C = FOREACH B {                          
   DA = CROSS s, s;                       
   GENERATE FLATTEN(DA);
}

Muster answered 2/12, 2012 at 14:59 Comment(0)

This can be done with a self-join and some simple filtering.

classes1 = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
classes2 = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
joined = JOIN classes1 BY class, classes2 BY class;
filtered = FILTER joined BY classes1.student < classes2.student;
pairs = FOREACH filtered GENERATE classes1.student AS student1, classes2.student AS student2;

Note that filtering by student1 < student2 gets you unique pairs.

Balliett answered 3/12, 2012 at 19:26 Comment(0)

Recommended topics

Hot tags