PIG how to count a number of rows in alias

E

7

56

I did something like this to count the number of rows in an alias in PIG:

logs = LOAD 'log'
logs_w_one = foreach logs generate 1 as one;
logs_group = group logs_w_one all;
logs_count = foreach logs_group generate SUM(logs_w_one.one);
dump logs_count;

This seems to be too inefficient. Please enlighten me if there is a better way!

Etoile answered 28/3, 2012 at 3:29 Comment(0)

G

101

COUNT is part of pig see the manual

LOGS= LOAD 'log';
LOGS_GROUP= GROUP LOGS ALL;
LOG_COUNT = FOREACH LOGS_GROUP GENERATE COUNT(LOGS);

Glorianna answered 28/3, 2012 at 5:2 Comment(4)

Is it possible to count the number of rows in an alias without first grouping? – Wilkie 13/1, 2014 at 19:57

Same question. I read that GROUP x ALL will force a serialization in the pipeline with consequent possible slow down. Is that correct? – Whitesell 14/5, 2014 at 18:29

You have to group before counting. According to pig.apache.org/docs/r0.15.0/func.html#count: "COUNT requires a preceding GROUP ALL statement for global counts and a GROUP BY statement for group counts." – Icelander 29/9, 2015 at 10:50

Does this have any sort of problem if LOGS is empty (zero rows)? – Culmiferous 23/9, 2019 at 22:18

S

34

Arnon Rotem-Gal-Oz already answered this question a while ago, but I thought some may like this slightly more concise version.

LOGS = LOAD 'log';
LOG_COUNT = FOREACH (GROUP LOGS ALL) GENERATE COUNT(LOGS);

Seclude answered 27/3, 2014 at 21:21 Comment(0)

A

32

Be careful, with COUNT your first item in the bag must not be null. Else you can use the function COUNT_STAR to count all rows.

Affricate answered 28/3, 2012 at 14:20 Comment(0)

O

10

Basic counting is done as was stated in other answers, and in the pig documentation:

logs = LOAD 'log';
all_logs_in_a_bag = GROUP logs ALL;
log_count = FOREACH all_logs_in_a_bag GENERATE COUNT(logs);
dump log_count

You are right that counting is inefficient, even when using pig's builtin COUNT because this will use one reducer. However, I had a revelation today that one of the ways to speed it up would be to reduce the RAM utilization of the relation we're counting.

In other words, when counting a relation, we don't actually care about the data itself so let's use as little RAM as possible. You were on the right track with your first iteration of the count script.

logs = LOAD 'log'
ones = FOREACH logs GENERATE 1 AS one:int;
counter_group = GROUP ones ALL;
log_count = FOREACH counter_group GENERATE COUNT(ones);
dump log_count

This will work on much larger relations than the previous script and should be much faster. The main difference between this and your original script is that we don't need to sum anything.

This also doesn't have the same problem as other solutions where null values would impact the count. This will count all the rows, regardless of if the first column is null or not.

Observer answered 13/1, 2016 at 0:24 Comment(0)

L

5

USE COUNT_STAR

LOGS= LOAD 'log';
LOGS_GROUP= GROUP LOGS ALL;
LOG_COUNT = FOREACH LOGS_GROUP GENERATE COUNT_STAR(LOGS);

Langobard answered 27/2, 2016 at 8:19 Comment(0)

B

3

Here is a version with optimization. All the solutions above would require pig to read and write full tuple when counting, this script below just write '1'-s

DEFINE row_count(inBag, name) RETURNS result {
    X = FOREACH $inBag generate 1;
    $result = FOREACH (GROUP X ALL PARALLEL 1) GENERATE '$name', COUNT(X);
};

The use it like

xxx = row_count(rows, 'rows_count');

Blumenfeld answered 13/8, 2015 at 20:8 Comment(0)

D

0

What you want is to count all the lines in a relation (dataset in Pig Latin)

This is very easy following the next steps:

logs = LOAD 'log'; --relation called logs, using PigStorage with tab as field delimiter
logs_grouped = GROUP logs ALL;--gives a relation with one row with logs as a bag
number = FOREACH LOGS_GROUP GENERATE COUNT_STAR(logs);--show me the number

I have to say it is important Kevin's point as using COUNT instead of COUNT_STAR we would have only the number of lines which first field is not null.

Also I like Jerome's one line syntax it is more concise but in order to be didactic I prefer to divide it in two and add some comment.

In general I prefer:

numerito = FOREACH (GROUP CARGADOS3 ALL) GENERATE COUNT_STAR(CARGADOS3);

over

name = GROUP CARGADOS3 ALL
number = FOREACH name GENERATE COUNT_STAR(CARGADOS3);

Decent answered 26/2, 2016 at 9:46 Comment(0)

Recommended topics

Hot tags