Calculate count of distinct values of a field using pig script
Asked Answered
D

2

7

For a file of the form

A B user1
C D user2
A D user3
A D user1

I want to calculate the count of distinct values of field 3 i.e. count(distinct(user1, user2,user2,user1)) = 3

I am doing this using the following pig script

A =  load 'myTestData' using PigStorage('\t') as (a1,a2,a3); 

user_list = foreach A GENERATE $2;
unique_users = DISTINCT user_list;
unique_users_group = GROUP unique_users ALL;
uu_count = FOREACH unique_users_group GENERATE COUNT(unique_users);
store uu_count into 'output';

Is there a better way to get count of distinct values of a field?

Dipterous answered 15/10, 2012 at 11:25 Comment(0)
T
8

A more up-to-date way to do this:

user_data = LOAD 'myTestData' USING PigStorage('\t') AS (a1,a2,a3);
users = FOREACH user_data GENERATE a3;
uniq_users = DISTINCT users;
grouped_users = GROUP uniq_users ALL;
uniq_user_count = FOREACH grouped_users GENERATE COUNT(uniq_users);
DUMP uniq_user_count;

This will leave the value (3) in your log.

Tenant answered 5/11, 2013 at 0:40 Comment(1)
What is c in GENERATE COUNT(c)? I think it should be COUNT(uniq_users).Root
C
4

I have one here which is a little more concise. You might want to check which one runs faster.

A =  LOAD 'myTestData' USING PigStorage('\t') AS (a1,a2,a3);
unique_users_group = GROUP A ALL;
uu_count = FOREACH unique_users_group {user = A.a2; uniq = distinct user; GENERATE COUNT(uniq);};
STORE uu_count INTO 'output';
Charlenacharlene answered 15/10, 2012 at 15:53 Comment(1)
This isn't right. You need to rename A to unique_users or vice-versa and be consistent. Your use of distinct also isn't correct: It should be uniq = DISTINCT user and won't parse in its current condition.Tenant

© 2022 - 2024 — McMap. All rights reserved.