I'm new to Pig and trying to correctly implement a somewhat common algorithm in which I need to pair every matching record in a set of records. In order to distill the question into its simplest form and also avoid discussing some business-specific sensitivities, here's a mock problem:
Say that I have a dataset representing college classes and students that attend them:
Philosophy,John
English,Mary
English,Sue
History,Jack
Philosophy,David
English,Mark
English,Larry
I want to pair every association between students that took the same class; so the output would include this, showing the explosion of the four 'English' rows into six associations:
Philosphy John,David
English Mary,Sue
English Mary,Mark
English Mary,Larry
English Sue,Mark
English Sue,Larry
English Mark,Larry
This page: http://ofps.oreilly.com/titles/9781449302641/advanced_pig_latin.html refers to using flatten() to effect the cross product. I have tried several approaches and researched this extensively and would post my attempts but honestly I'm flailing and I think that would just confuse the reader and not provide any value. But here's the boilerplate:
s = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
grp = group s by class;
...
(I believe the problem I'm facing has to do with flatten requiring multiple bags, not multiple fields, and I can't figure out how to get my group'ing to generate multiple bags...)
Thank you for any assistance!