finding mean using pig or hadoop
Asked Answered
C

4

6

I have a huge text file of form

data is saved in directory data/data1.txt, data2.txt and so on

merchant_id, user_id, amount
1234, 9123, 299.2
1233, 9199, 203.2
 1234, 0124, 230
 and so on..

What I want to do is for each merchant, find the average amount..

so basically in the end i want to save the output in file. something like

 merchant_id, average_amount
  1234, avg_amt_1234 a
  and so on.

How do I calculate the standard deviation as well?

Sorry for asking such a basic question. :( Any help would be appreciated. :)

Coincide answered 26/9, 2012 at 1:56 Comment(1)
A good answer to this question would have to cover the basics of hadoop as well as the algorithms necessary to calculate the various metrics. I would re-edit this question, or perhaps ask another and state up front what you do know about how to solve this with hadoop or pig and be more specific about the one thing that's holding you up.Bullring
T
13

Apache PIG is well adapted for such tasks. See example:

inpt = load '~/pig_data/pig_fun/input/group.txt' as (amnt:double, id:chararray,c2:chararray);
grp = group inpt by id;
mean = foreach grp {
    sum = SUM(inpt.amnt);
    count = COUNT(inpt);
    generate group as id, sum/count as mean, sum as sum, count as count;
};

Pay special attention to the data type of the amnt column as it will influence which implementation of the SUM function PIG is going to invoke.

PIG can also do something that SQL can not, it can put the mean against each input row without using any inner joins. That is useful if you are calculating z-scores using standard deviation.

 mean = foreach grp {
    sum = SUM(inpt.amnt);
    count = COUNT(inpt);
    generate FLATTEN(inpt), sum/count as mean, sum as sum, count as count;
};

FLATTEN(inpt) does the trick, now you have access to the original amount that had contributed to the groups average, sum and count.

UPDATE 1:

Calculating variance and standard deviation:

inpt = load '~/pig_data/pig_fun/input/group.txt' as (amnt:double, id:chararray, c2:chararray);
grp = group inpt by id;
mean = foreach grp {
        sum = SUM(inpt.amnt);
        count = COUNT(inpt);
        generate flatten(inpt), sum/count as avg, count as count;
};
tmp = foreach mean {
    dif = (amnt - avg) * (amnt - avg) ;
     generate *, dif as dif;
};
grp = group tmp by id;
standard_tmp = foreach grp generate flatten(tmp), SUM(tmp.dif) as sqr_sum; 
standard = foreach standard_tmp generate *, sqr_sum / count as variance, SQRT(sqr_sum / count) as standard;

It will use 2 jobs. I have not figured out how to do it in one, hmm, need to spend more time on it.

Tegucigalpa answered 27/9, 2012 at 10:8 Comment(3)
how do i calculate standard deviation??Coincide
are there any chances of sum value overflowing ? I am trying to implement something like this but I am concerned about the overflow.Heron
Double is 64-bit and as any data type it can overflow. COUNT returns long, so it could potentially overflow. But would it happen with your data? Only if it is some sort of complex scientific calculation.Tegucigalpa
I
1

So what do you want? You want the running java code or the abstract map-reduce process? For the second:

The map step:

record -> (merchant_id as key, amount as value)

The reduce step:

(merchant_id, amount) -> (merchant_id, aggregate the value you want)

As in the reduce step, you will be provided with a stream of record having the same key and you can do almost everything you can including the average, variance.

Indeliberate answered 26/9, 2012 at 2:27 Comment(0)
M
1

you can calculate the standard deviation just in one step; using the formula

var=E(x^2)-(Ex)^2
inpt = load '~/pig_data/pig_fun/input/group.txt' as (amnt:double,  id:chararray, c2:chararray);
grp = group inpt by id;
mean = foreach grp {
    sum = SUM(inpt.amnt);
    sum2 = SUM(inpt.amnt**2);
    count = COUNT(inpt);
    generate flatten(inpt), sum/count as avg, count as count, sum2/count-    (sum/count)**2 as std;
};

that's it!

Maricamarice answered 1/10, 2015 at 7:15 Comment(1)
Please, consider improving the format fo your answer along with some description.Aquino
B
0

I calculated all stats(min, max, mean and standard deviation) in just 1 loop. FILTER_DATA contains data-set.

    GROUP_SYMBOL_YEAR = GROUP FILTER_DATA BY (SYMBOL, SUBSTRING(TIMESTAMP,0,4));
STATS_ALL = FOREACH GROUP_SYMBOL_YEAR { 
    MINIMUM = MIN(FILTER_DATA.CLOSE);
    MAXIMUM = MAX(FILTER_DATA.CLOSE);
    MEAN = AVG(FILTER_DATA.CLOSE);
    CNT = COUNT(FILTER_DATA.CLOSE);
    CSQ = FOREACH FILTER_DATA GENERATE CLOSE * CLOSE AS (CC:DOUBLE);
    GENERATE group.$0 AS (SYMBOL:CHARARRAY), MINIMUM AS (MIN:DOUBLE), MAXIMUM AS (MAX:DOUBLE), ROUND_TO(MEAN,6) AS (MEAN:DOUBLE), ROUND_TO(SQRT(SUM(CSQ.CC) / (CNT * 1.0) - (MEAN * MEAN)),6) AS (STDDEV:DOUBLE), group.$1 AS (YEAR:INT);
};
Balmacaan answered 10/11, 2020 at 9:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.