Hadoop Pig count number
Asked Answered
G

1

6

I am learning how to use Hadoop Pig now.

If I have a input file like this:

a,b,c,true
s,c,v,false
a,s,b,true
...

The last field is the one I need to count... So I want to know how many 'true' and 'false' in this file.

I try:

records = LOAD 'test/input.csv' USING PigStorage(',');
boolean = foreach records generate $3;
groups = group boolean all;

Now I gets stuck. I want to use:

count = foreach groups generate count('true');" 

To get the number of "true" but I always get the error:

2013-08-07 16:32:36,677 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve count using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.] Details at logfile: /etc/pig/pig_1375911119028.log

Can anybody tell me where the problem is?

Giana answered 7/8, 2013 at 22:45 Comment(0)
S
11

Two things. Firstly, count should actually be COUNT. In pig, all builtin functions should be called with all-caps.

Secondly, COUNT counts the number of values in a bag, not for a value. Therefore, you should group by true/false, then COUNT:

boolean = FOREACH records GENERATE $3 AS trueORfalse ;
groups = GROUP boolean BY trueORfalse ;
counts = FOREACH groups GENERATE group AS trueORfalse, COUNT(boolean) ;

So now the output of a DUMP for counts will look something like:

(true, 2)
(false, 1)

If you want the counts of true and false in their own relations then you can FILTER the output of counts. However, it would probably be better to SPLIT boolean, then do two separate counts:

boolean = FOREACH records GENERATE $3 AS trueORfalse ;
SPLIT boolean INTO alltrue IF trueORfalse == 'true', 
                   allfalse IF trueORfalse == 'false' ;

tcount = FOREACH (GROUP alltrue ALL) GENERATE COUNT(alltrue) ;
fcount = FOREACH (GROUP allfalse ALL) GENERATE COUNT(allfalse) ;
Spacesuit answered 7/8, 2013 at 23:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.