Can I generate nested bags using nested FOREACH statements in Pig Latin?
Asked Answered
S

5

8

Let's say I have a data set of restaurant reviews:

User,City,Restaurant,Rating
Jim,New York,Mecurials,3
Jim,New York,Whapme,4.5
Jim,London,Pint Size,2
Lisa,London,Pint Size,4
Lisa,London,Rabbit Whole,3.5

And I want to produce a list by user and city of average review. I.e. output:

User,City,AverageRating
Jim,New York,3.75
Jim,London,2
Lisa,London,3.75

I could write a Pig script as follows:

Data = LOAD 'data.txt' USING PigStorage(',') AS (
    user:chararray, city:chararray, restaurant:charray, rating:float
);

PerUserCity = GROUP Data BY (user, city);

ResultSet = FOREACH PerUserCity {
    GENERATE group.user, group.city, AVG(Data.rating);
}

However I'm curious whether I can first group the higher level group (the users) and then sub group the next level (the cities) later: i.e.

PerUser = GROUP Data BY user;

Intermediate = FOREACH PerUser {
    B = GROUP Data BY city;
    GENERATE group AS user, B;
}

I get:

Error during parsing.
Invalid alias: GROUP in {
  group: chararray,
  Data: {
    user: chararray,
    city: chararray,
    restaurant: chararray,
    rating: float
  }
}

Has anyone tried this with success? Is it simply not possible to GROUP within a FOREACH?

My goal is to do something like:

ResultSet = FOREACH PerUser {
    FOREACH City {
        GENERATE user, city, AVG(City.rating)
    }
}
Swedenborgian answered 8/2, 2011 at 11:53 Comment(0)
R
9

Currently the allowed operations are DISTINCT, FILTER, LIMIT, and ORDER BY inside a FOREACH.

For now grouping directly by (user, city) is the good way to do as you said.

Roughen answered 11/2, 2011 at 18:17 Comment(0)
P
2

Release notes for Pig version 0.10 suggest that nested FOREACH operations are now supported.

Percolation answered 27/12, 2012 at 15:45 Comment(2)
Thank you. Why are two GENERATEs required in the inner block?Swedenborgian
Retracting my suggestion. Release notes suggest this can be done, but I can't get it working.Percolation
Y
1

Try this:

Records = load 'data_rating.txt' using PigStorage(',') as (user:chararray, city:chararray, restaurant:chararray, rating:float);
grpRecs = group Records By (user,city);
avgRating_Byuser_perCity = foreach grpRecs generate AVG(Records.rating) as average; 
Result = foreach avgRating_Byuser_perCity generate flatten(group), average;
Yolk answered 18/5, 2014 at 18:37 Comment(2)
You should add a description what this code accomplishes and how it does this.Mylander
This is wrong... It should be Records = load 'data_rating.txt' using PigStorage(',') as (user:chararray, city:chararray, restaurant:chararray, rating:float); grpRecs = group Records By (user,city); avgRating_Byuser_perCity = foreach grpRecs generate flatten(group), AVG(Records.rating) as average; Result = dump avgRating_Byuser_perCity ;Straggle
N
0
awdata = load 'data' using PigStorage(',') as (user:chararray , city:chararray , restaurant:chararray , rating:float);
data = filter rawdata by user != 'User';

groupbyusercity = group data by (user,city);

--describe groupbyusercity;
--groupbyusercity: {group: (user: chararray,city: chararray),data: {(user: chararray,city: chararray,restaurant: chararray,rating: float)}}

average = foreach groupbyusercity {
    generate group.user,group.city,AVG(data.rating);
}

dump average;
Nally answered 29/4, 2014 at 7:20 Comment(0)
J
0

Grouping by two keys and then flattening the structure leads to the same result:

Loading Data like you did

Data = LOAD 'data.txt' USING PigStorage(',') AS (
    user:chararray, city:chararray, restaurant:charray, rating:float);

Group by user and city

 ByUserByCity = GROUP Data BY (user, city);

Add Rating average of the groups (you can add more, like COUNT(Data) as count_res) Then flatten the group structure to the original one.

ByUserByCityAvg = FOREACH ByUserByCity GENERATE
FLATTEN(group) AS (user, city),
AVG(Data.rating) as user_city_avg;

Results in:

Jim,London,2.0
Jim,New York,3.75
Lisa,London,3.75
User,City,
Joe answered 21/5, 2014 at 11:7 Comment(1)
I guess, this does not answer the questionStraggle

© 2022 - 2024 — McMap. All rights reserved.