I have a dataset, A
, that has timestamp, visitor, URL:
(2012-07-21T14:00:00.000Z, joe, hxxp:///www.aaa.com)
(2012-07-21T14:01:00.000Z, mary, hxxp://www.bbb.com)
(2012-07-21T14:02:00.000Z, joe, hxxp:///www.aaa.com)
I want to measure number of visits per user per URL in a time window of say, 10 minutes, but as a rolling window that increments by the minute. Output would be:
(2012-07-21T14:00 to 2012-07-21T14:10, joe, hxxp://www.aaa.com, 2)
(2012-07-21T14:01 to 2012-07-21T14:11, joe, hxxp://www.aaa.com, 1)
To make the arithmetic easy, I change the timestamp to minute of the day, as:
(840, joe, hxxp://www.aaa.com) /* 840 = 14:00 hrs x 60 + 00 mins) */
To iterate over 'A' by a moving time window, I create a dataset B of minutes in the day:
(0)
(1)
(2)
.
.
.
.
(1440)
Ideally, I want to do something like:
A = load 'dataset1' AS (ts, visitor, uri)
B = load 'dataset2' as (minute)
foreach B {
C = filter A by ts > minute AND ts < minute + 10;
D = GROUP C BY (visitor, uri);
foreach D GENERATE group, count(C) as mycnt;
}
DUMP B;
I know "GROUP" isn't allowed inside a "FOREACH" loop but is there a workaround to achieve the same result?
Thanks!