Trying to count cumulative distinct entities using Redshift SQL
Asked Answered
F

6

8

I'm trying to get a cumulative count of distinct objects in Redshift over a time series. The straightforward thing would be to use COUNT(DISTINCT myfield) OVER (ORDER BY timefield DESC ROWS UNBOUNDED PRECEDING), but Redshift gives a "Window definition is not supported" error.

For example, the code below is trying to find the cumulative distinct users for every week from the first week to the present. However, I get the "Window function not supported" error.

SELECT user_time.weeks_ago, 
       COUNT(distinct user_time.user_id) OVER
            (ORDER BY weeks_ago desc ROWS UNBOUNDED PRECEDING) as count
FROM   (SELECT FLOOR(EXTRACT(DAY FROM sysdate - ev.time) / 7) AS weeks_ago,
               ev.user_id as user_id
        FROM events as ev
        WHERE ev.action='some_user_action') as user_time

The goal is to build a cumulative time series of unique users who have performed an action. Any ideas on how to do this?

Firenew answered 26/11, 2013 at 7:26 Comment(0)
P
5

Here's how to apply it to an example cited here, plus I've added another row duplicating 'table' for '2015-01-01' to demonstrate how this counts distincts.

The author of the example is wrong about the solution, but I'm just using his example.

create table public.test
(
  "date" date,
  item varchar(8),
  measure int
)

insert into public.test
    values
      ('2015-01-01', 'table',   12),
      ('2015-01-01', 'table',   120),
      ('2015-01-01', 'chair',   51),
      ('2015-01-01', 'lamp',    8),
      ('2015-01-02', 'table',   17),
      ('2015-01-02', 'chair',   72),
      ('2015-01-02', 'lamp',    23),
      ('2015-01-02', 'bed',     1),
      ('2015-01-02', 'dresser', 2),
      ('2015-01-03', 'bed',     1);

WITH x AS (
    SELECT
      *,
      DENSE_RANK()
      OVER (PARTITION BY date
        ORDER BY item) AS dense_rank
    FROM public.test
)
SELECT
  "date",
  item,
  measure,
  max(dense_rank)
  OVER (PARTITION BY "date")
FROM x
ORDER BY 1;

The CTE gets you the dense rank of each item per date, then the main query gets you the max of that dense rank per date, i.e., the distinct count of items per date.

You need the dense rank rather than straight rank to count distincts.

Pyrogallate answered 1/9, 2016 at 22:37 Comment(2)
I saw the same linked example that doesn't work. But this helped. Thanks.Hedwig
what do you do when you don't want to return every row with a select *? I have a case where I want to count distinct customers over a month interval, but when I order by customer_id in the partition the return set gives every rank value, even though I only want the maximum for the month.Purity
F
3

Figured out the answer. The trick turned out to be a set of nested subqueries, the inner one calculates the time of each user's first action. The middle subquery counts the total actions per time period, and the final outer query performs the cumulative sums over the time series:

(SELECT engaged_per_week.week as week,
       SUM(engaged_per_week.total) over (order by engaged_per_week.week DESC ROWS UNBOUNDED PRECEDING) as total
 FROM 
    -- COUNT OF FIRST TIME ENGAGEMENTS PER WEEK
    (SELECT engaged.first_week AS week,
            count(engaged.first_week) AS total
    FROM
       -- WEEK OF FIRST ENGAGEMENT FOR EACH USER
       (SELECT  MAX(FLOOR(EXTRACT(DAY FROM sysdate - ev.time) / 7)) as first_week
        FROM     events ev
        WHERE    ev.name='some_user_action'
        GROUP BY ev.user_id) AS engaged

    GROUP BY week) as engaged_per_week
ORDER BY week DESC) as cumulative_engaged
Firenew answered 27/11, 2013 at 8:48 Comment(0)
B
1

You should use DENSE_RANK instead of count (distinct):

DENSE_RANK() OVER(PARTITION BY weeks_ago ORDER BY user_time.user_id)
Breaker answered 2/3, 2017 at 3:4 Comment(0)
T
1

It seems to be working when you use count distinct inside a sum like this:

 SELECT user_time.weeks_ago, 
   SUM(COUNT(distinct user_time.user_id)) OVER
        (ORDER BY weeks_ago desc ROWS UNBOUNDED PRECEDING) as test
        FROM   (SELECT FLOOR(EXTRACT(DAY FROM sysdate - ev.time) / 7) AS weeks_ago
            ,ev.user_id as user_id
    FROM events as ev
    WHERE ev.action='some_user_action'
    ) user_time
GROUP BY user_time.weeks_ago
Theressathereto answered 19/4, 2018 at 6:21 Comment(0)
V
1

None of the above solutions worked for me. Here is the one that did. he way to think about this problem is as follows:

  • if someone has made certain action in the first action - count them in that week
  • for any consecutive week count only additional users - those not present in the previous weeks

Therefore we just need to find the first date period when each user made an appearance, then cumulative sum of those based ordered by date and then group by date and find max(cumsum) value.

with first_date as (SELECT user_id,
                           min(ev.date) as first_entry_date
                    FROM events 
                    WHERE certain_condition
                    GROUP by 1
               ),
     ranked as (SELECT count(*) OVER (ORDER BY first_entry_date rows unbounded preceding) as counts,
                        first_entry_date 
                 FROM deduped
              )
SELECT first_entry_date as day, 
       max(counts) as users_cum_sum 
FROM ranked 
GROUP BY 1
Vashti answered 9/9, 2021 at 12:13 Comment(1)
This is the only answer that correctly keeps track of users that have entries for separate parameters such as entry date without fail.Lubricant
O
0

I'm faced the same issue but I've applied this with DENSE_RANK() and MAX() over(partition by) as below Code, hope it'll be helpful if anyone still struggling with this issue:

-- IN NZ

select 
    id,NAME,count(distinct name) OVER (
        PARTITION BY id)
        from
edw.admin.test;

/*
create table edw.admin.test 
as 
(       
select 1 as id,'Anne' as name,500.0 as amt,'iv' as IID
    union ALL
select 1,'Jeni',550.0,'is'
    union ALL
select 1,'Arna',250.0,'is'
    union ALL
select 2,'Raj',290.0,'is'
    union ALL
select 1,'Anne',350.0,'ir'
    union ALL
select 1,NULL,350.0,'ir'
    union ALL
select 3,NULL,350.0,'ir'
    union ALL
select 3,NULL,350.0,'ir');

Output in NZ:
-------------------------
ID  NAME    COUNT
1   NULL    3
1   Anne    3
1   Anne    3
1   Arna    3
1   Jeni    3
2   Raj     1
3   NULL    0
3   NULL    0
*/


-- IN AWS RS



select id, name, max(DENSE_COUNT) over(partition by id)
from(
select 
    id,name,CASE WHEN name IS NULL THEN 0 ELSE DENSE_RANK() OVER (
        PARTITION BY id
        order by name) END AS DENSE_COUNT
        from
(       
select 1 as id,'Anne' as name,500.0 as amt,'iv' as IID
    union ALL
select 1,'Jeni',550.0,'is'
    union ALL
select 1,'Arna',250.0,'is'
    union ALL
select 2,'Raj',290.0,'is'
    union ALL
select 1,'Anne',350.0,'ir'
    union ALL
select 1,NULL,350.0,'ir'
    union ALL
select 3,NULL,350.0,'ir'
    union ALL
select 3,NULL,350.0,'ir'));

/*
Output in RS:
-------------------------
id  name    max
1   Anne    3
1   Anne    3
1   Arna    3
1   Jeni    3
1   NULL    3
2   Raj     1
3   NULL    0
3   NULL    0
*/
Occasional answered 11/10, 2018 at 15:54 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.