Pig 0.11.1 - Count groups in a time range

Asked 1/8, 2013 at 20:38 Answered 1/8, 2013 at 21:52

Solved hadoop mapreduce range apache-pig

I have a dataset, A, that has timestamp, visitor, URL:

(2012-07-21T14:00:00.000Z, joe, hxxp:///www.aaa.com) 
(2012-07-21T14:01:00.000Z, mary, hxxp://www.bbb.com) 
(2012-07-21T14:02:00.000Z, joe, hxxp:///www.aaa.com)

I want to measure number of visits per user per URL in a time window of say, 10 minutes, but as a rolling window that increments by the minute. Output would be:

(2012-07-21T14:00 to 2012-07-21T14:10, joe, hxxp://www.aaa.com, 2)
(2012-07-21T14:01 to 2012-07-21T14:11, joe, hxxp://www.aaa.com, 1)

To make the arithmetic easy, I change the timestamp to minute of the day, as:

(840, joe, hxxp://www.aaa.com) /* 840 = 14:00 hrs x 60 + 00 mins) */

To iterate over 'A' by a moving time window, I create a dataset B of minutes in the day:

(0)
(1)
(2)
.
.
.
.
(1440)

Ideally, I want to do something like:

A = load 'dataset1' AS (ts, visitor, uri)
B = load 'dataset2' as (minute)

foreach B {
C = filter A by ts > minute AND ts < minute + 10;
D = GROUP C BY (visitor, uri);
foreach D GENERATE group, count(C) as mycnt;
}

DUMP B;

I know "GROUP" isn't allowed inside a "FOREACH" loop but is there a workaround to achieve the same result?

Thanks!

Plascencia answered 1/8, 2013 at 20:38 Comment(0)

Maybe you can do something like this?

NOTE: This is dependent on the minutes you create for the logs being integers. If they are not then you can round to the nearest minute.

myudf.py

#!/usr/bin/python

@outputSchema('expanded: {(num:int)}')
def expand(start, end):
        return [ (x) for x in range(start, end) ]

myscript.pig

register 'myudf.py' using jython as myudf ;

-- A1 is the minutes. Schema:
-- A1: {minute: int}
-- A2 is the logs. Schema:
-- A2: {minute: int,name: chararray}
-- These schemas should change to fit your needs.

B = FOREACH A1 GENERATE minute, 
                        FLATTEN(myudf.expand(minute, minute+10)) AS matchto ;
-- B is in the form:
-- 1 1
-- 1 2
-- ....
-- 2 2
-- 2 3
-- ....
-- 100 100
-- 100 101
-- etc.

-- Now we join on the minute in the second column of B with the 
-- minute in the log, then it is just grouping by the minute in
-- the first column and name and counting
C = JOIN B BY matchto, A2 BY minute ;
D = FOREACH (GROUP C BY (B::minute, name)) 
            GENERATE FLATTEN(group), COUNT(C) as count ;

I'm a little worried about speed for larger logs, but it should work. Let me know if you need me to explain anything.

Mascara answered 1/8, 2013 at 21:52 Comment(2)

Thanks a bunch. That worked just fine. Sorry, I don't have enough reputation points to bump this solution up. I am not sure this is slow or fast since I don't have equivalent logic to compare with on my cluster :) Anyways, the admins are screaming at me yet so it is good. Btw, /return [ (x) for x in range(start + 1, end) ]/ should be /return [ (x) for x in range(start, end) ]/ since minutes of the day start from 0. – Plascencia 6/8, 2013 at 21:14

@JoeNate Good catch on the start of range. Glad to help. :) – Mascara 7/8, 2013 at 2:14

A = load 'dataSet1' as (ts, visitor, uri);
houred = FOREACH A GENERATE user, org.apache.pig.tutorial.ExtractHour(time) as hour, uri;
hour_frequency1 = GROUP houred BY (hour, user);

Something like this should help ExtractHour is a UDF, you could create something similar for your required Duration. Then grouping by Hour and then User Your can use the GENERATE to do a count.

http://pig.apache.org/docs/r0.7.0/tutorial.html

Courante answered 1/8, 2013 at 21:6 Comment(1)

I think that will just give me frequency per hour per user. It won't give me frequency per variable period of time (say, every 10 minutes) and isn't a rolling frequency (say, by minute). – Plascencia 1/8, 2013 at 21:24

myudf.py

myscript.pig

Recommended topics

Hot tags