redshift: count distinct customers over window partition
Asked Answered
B

3

18

Redshift doesn't support DISTINCT aggregates in its window functions. AWS documentation for COUNT states this, and distinct isn't supported for any of the window functions.

My use case: count customers over varying time intervals and traffic channels

I desire monthly and YTD unique customer counts for the current year, and also split by traffic channel as well as total for all channels. Since a customer can visit more than once I need to count only distinct customers, and therefore the Redshift window aggregates won't help.

  • I can count distinct customers using count(distinct customer_id)...group by, but this will give me only a single result of the four needed.
  • I don't want to get into the habit of running a full query for each desired count piled up between a bunch of union all. I hope this is not the only solution.

This is what I would write in postgres (or Oracle for that matter):

select order_month
       , traffic_channel
       , count(distinct customer_id) over(partition by order_month, traffic_channel) as customers_by_channel_and_month
       , count(distinct customer_id) over(partition by traffic_channel) as ytd_customers_by_channel
       , count(distinct customer_id) over(partition by order_month) as monthly_customers_all_channels
       , count(distinct customer_id) over() as ytd_total_customers

from orders_traffic_channels
/* otc is a table of dated transactions of customers, channels, and month of order */

where to_char(order_month, 'YYYY') = '2017'

How can I solve this in Redshift?

The result needs to work on a redshift cluster, furthermore this is a simplified problem and the actual desired result has product category and customer type, which multiplies the number of partitions needed. Therefore a stack of union all rollups is not a nice solution.

Bellringer answered 10/12, 2017 at 6:48 Comment(0)
B
24

A blog post from 2016 calls out this problem and provides a rudimentary workaround, so thank you Mark D. Adams. There is strangely very little I could find on all of the web therefore I'm sharing my (tested) solution.

The key insight is that dense_rank(), ordered by the item in question, provides the same rank to identical items, and therefore the highest rank is also the count of unique items. This is a horrible mess if you try to swap in the following for each partition I want:

dense_rank() over(partition by order_month, traffic_channel order by customer_id)

Since you need the highest rank, you have to subquery everything and select the max value from each ranking taken. Its important to match the partitions in the outer query to the corresponding partition in the subquery.

/* multigrain windowed distinct count, additional grains are one dense_rank and one max over() */
select distinct
       order_month
       , traffic_channel
       , max(tc_mth_rnk) over(partition by order_month, traffic_channel) customers_by_channel_and_month
       , max(tc_rnk) over(partition by traffic_channel)  ytd_customers_by_channel
       , max(mth_rnk) over(partition by order_month)  monthly_customers_all_channels
       , max(cust_rnk) over()  ytd_total_customers

from (
       select order_month
              , traffic_channel
              , dense_rank() over(partition by order_month, traffic_channel order by customer_id)  tc_mth_rnk
              , dense_rank() over(partition by traffic_channel order by customer_id)  tc_rnk
              , dense_rank() over(partition by order_month order by customer_id)  mth_rnk
              , dense_rank() over(order by customer_id)  cust_rnk

       from orders_traffic_channels

       where to_char(order_month, 'YYYY') = '2017'
     )

order by order_month, traffic_channel
;

notes

  • partitions of max() and dense_rank() must match
  • dense_rank() will rank null values (all at the same rank, the max). If you want to not count null values you need a case when customer_id is not null then dense_rank() ...etc..., or you can subtract one from the max() if you know there are nulls.

Update 2022

Count distinct over partitions in redshift is still not implemented.

I've concluded that this workaround is reasonable if you take care when incorporating it into production pipelines with these in mind:

  • It creates a lot of code which can hurt readability and maintenance.
  • Isolate this process of counting by groups into one transform stage rather than mixing this with other logical concepts in the same query.
  • Using subqueries and non-partitioned groups with count(distinct ..) to get each of your distinct counts is even messier and less readable.

However, the better way is to use dataframe languages that support grouped rollups like Spark or Pandas. Spark rollups by group are compact and readable, the tradeoff is bringing another execution environment and language into your flows.

Bellringer answered 10/12, 2017 at 6:48 Comment(3)
Suppose i need to get the sum(sales) together with the count of customers, how would i do that? Say i use the SUM() over(partition by) to get a running sum of sales along with the dense_rank(), how do i extract the sum corresponding to the max(dense_rank)?Rhona
FWIW, Mark D Adams' solution was incomplete (see #20211402) but I totally agree with your answer which completes Mark's idea.Rottenstone
@redwolf_po7 By ‘running sum’ i think you mean cumulative sum as each customer is ordered into the partition. Usually that would be done when ordering over time. I don’t think thise fit together. But if you just want sums over all customers per partition that should work fine; give it a test.Bellringer
R
5

While Redshift doesn't support DISTINCT aggregates in its window functions, it does have a listaggdistinct function. So you can do this:

regexp_count(
   listaggdistinct(customer_id, ',') over (partition by field2), 
   ','
) + 1

Of course, if you have , naturally occurring in your customer_id strings, you'll have to find a safe delimiter.

Rottenstone answered 7/5, 2020 at 23:35 Comment(3)
I suspect that something will fail when counting 10MM records from a table.Bellringer
If the result set is larger than the maximum VARCHAR size (64K – 1, or 65535), then LISTAGG returns the following error: Invalid operation: Result size exceeds LISTAGG limit. See docWistrup
listaggdistinct is not an official RedShift function ... It works (sometime) but it is not documentedSnowdrop
B
1

Another approach is to use

In first select:

row_number() over (partition by customer_id,order_month,traffic_channel) as row_n_month_channel 

and in the next select

sum(case when row_n_month_channel=1 then 1 else 0 end)
Brownley answered 7/4, 2022 at 6:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.