Counting elements for each group using Pig

Asked 29/7, 2014 at 9:44 Answered 31/7, 2022 at 0:29

I'm trying to group and count the frequency of terms for each group in PigLatin, but I'm having some troubles to figure it out how to do it.

I have a collection of objects with the following schema:

{cluster_id: bytearray,terms: chararray}

And here is some samples

(10, smerter)
(10, graviditeten)
(10, smerter)
(10, smerter)
(10, udemærket)    
(20, eis feuer)
(20, herunterladen schau)
(20, download gratis)
(20, download gratis) 
(30, anschauen kinofilm)
(30, kauf rechnung)
(30, kauf rechnung)
(30, versandkostenfreie lieferung)
(30, kostenlose)
(30, kostenlose)
(30, kostenlose)

the result I m trying to get is something like this

(10, smerter, 3)
(10, graviditeten, 2)
(10, udemærket, 1)
(20, download gratis, 2)
(20, eis feuer, 1)
(20, herunterladen schau, 1)    
(30, kostenlose, 3)
(30, kauf rechnung, 2)
(30, anschauen kinofilm, 1)
(30, versandkostenfreie lieferung, 1)

What would be the best way to do that? The following code groups by id and count the terms, but I wanted to count the terms for each group.

by_clusters = GROUP sample_data by cluster_id;
by_clusters_terms_count = FOREACH by_clusters GENERATE group as cluster_id, COUNT($1);

I make the grouping like this I end up with an object with the following schema

by_clusters: {group: bytearray,sample_data: {(cluster_id: bytearray,terms: chararray)}}

Now, I get to the point to actually count the terms inside the 'sample_data' tuple. I'm thinking about nested foreach, but I still didn't get it how could I apply it in this case. The code would be something like the following:

result = FOREACH by_clusters {

--count terms here, I don't know how

-- compiler gives me an error here
c = GROUP $1 BY terms; -- 
d = FOREACH c GENERATE COUNT(b), group;

GENERATE cluster_id, d;
}

Error I get:

ERROR 1200: Syntax error, unexpected symbol at or near '$1

Finally, I think I'm close, but I'm unable to solve it. I don't believe I'll have to write an UDF in this case.

Flowery answered 29/7, 2014 at 9:44 Comment(0)

I think that what you want to do is simply group by cluster_id and terms.

You were very close to the result with you first try, just add terms to your group :

by_clusters = GROUP sample_data by (cluster_id, terms);
by_clusters_terms_count = FOREACH by_clusters GENERATE FLATTEN(group) as (cluster_id, terms), COUNT($1);

I hope I understood well what you want!

Fistula answered 29/7, 2014 at 12:22 Comment(2)

Question: why does COUNT($1) work as opposed to COUNT(terms)? E.g., does COUNT($1) mean "count the number of instances of each unique pair (cluster_id, terms)"? – Kohlrabi 8/6, 2017 at 19:24

Can someone help me understand how the lowercase "group" in "FLATTEN(group)" is used? I can't seem to find a reference.. all results are pointing to GROUP .. BY :( – Ebonee 5/9, 2020 at 1:39

You can replace COUNT($1) with COUNT(sample_data) as well.

   by_clusters = GROUP sample_data by (cluster_id, terms);
    by_clusters_terms_count = FOREACH by_clusters GENERATE FLATTEN(group) as (cluster_id, terms), COUNT(*sample_data*);

michael-xu here is the difference :

FLATTENed result:

by_clusters_terms_count = FOREACH by_clusters GENERATE FLATTEN(group) , COUNT($1);

result : (10, smerter,3)

Not FLATTENed result:

by_clusters_terms_count = FOREACH by_clusters GENERATE group , COUNT($1);

result : ((10, smerter),3)

The key field is named “group”. Here is the schema for by_clusters_terms_count

{group:(cluster_id: bytearray,terms: chararray),long}

Dermatitis answered 31/7, 2022 at 0:29 Comment(0)

Recommended topics

Hot tags