BigQuery SQL for 28-day sliding window aggregate (without writing 28 lines of SQL)
Asked Answered
T

4

16

I'm trying to compute a 28 day moving sum in BigQuery using the LAG function.

The top answer to this question

Bigquery SQL for sliding window aggregate

from Felipe Hoffa indicates that that you can use the LAG function. An example of this would be:

SELECT
    spend + spend_lagged_1day + spend_lagged_2day + spend_lagged_3day + ... +  spend_lagged_27day as spend_28_day_sum,
    user,
    date
FROM (
  SELECT spend,
         LAG(spend, 1) OVER (PARTITION BY user ORDER BY date) spend_lagged_1day,
         LAG(spend, 2) OVER (PARTITION BY user ORDER BY date) spend_lagged_2day,
         LAG(spend, 3) OVER (PARTITION BY user ORDER BY date) spend_lagged_3day,
         ...
         LAG(spend, 28) OVER (PARTITION BY user ORDER BY date) spend_lagged_day,
         user,
         date
  FROM user_spend
)

Is there a way to do this without having to write out 28 lines of SQL!

Tovatovar answered 19/12, 2014 at 0:8 Comment(0)
D
54

The BigQuery documentation doesn't do a good job of explaining the complexity of window functions that the tool supports because it doesn't specify what expressions can appear after ROWS or RANGE. It actually supports the SQL 2003 standard for window functions, which you can find documented other places on the web, such as here.

That means you can get the effect you want with a single window function. The range is 27 because it's how many rows before the current one to include in the sum.

SELECT spend,
       SUM(spend) OVER (PARTITION BY user ORDER BY date ROWS BETWEEN 27 PRECEDING AND CURRENT ROW),
       user,
       date
FROM user_spend;

A RANGE bound can also be extremely useful. If your table was missing dates for some user, then 27 PRECEDING rows would go back more than 27 days, but RANGE will produce a window based on the date values themselves. In the following query, the date field is a BigQuery TIMESTAMP and the range is specified in microseconds. I'd advise that whenever you do date math like this in BigQuery, you test it thoroughly to make sure it's giving you the expected answer.

SELECT spend,
       SUM(spend) OVER (PARTITION BY user ORDER BY date RANGE BETWEEN 27 * 24 * 60 * 60 * 1000000 PRECEDING AND CURRENT ROW),
       user,
       date
FROM user_spend;
Dugald answered 19/12, 2014 at 21:58 Comment(1)
The documentation has probably evolved since the time of the original post: ``` Tip: If you want to use a range with a date, use ORDER BY with the UNIX_DATE() function. If you want to use a range with a timestamp, use the UNIX_SECONDS(), UNIX_MILLIS(), or UNIX_MICROS() function. ``` Ref: cloud.google.com/bigquery/docs/reference/standard-sql/…Antiparallel
A
22

Bigquery: How to get a rolling time range in a window clause.....

This is an old post, but I spend a long time searching for a solution, and this post came up so maybe this will help someone.

IF your partition of your window clause does not have a record for every day, you need to use the RANGE clause to accurately get a rolling time range, (ROWS would search the number records, which would to go too far back since you don't have a record for every day in your PARTITION BY). The problem is that in Bigquery RANGE clause does not support Dates.

From BigQuery's documentation:

numeric_expression must have numeric type. DATE and TIMESTAMP are not currently supported. In addition, the numeric_expression must be a constant, non-negative integer or a parameter.

The workaround I found was to use the UNIX_DATE(date_expression) in the ORDER BY clause along with a RANGE clause:

SUM(value) OVER (PARTITION BY Column1 ORDER BY UNIX_DATE(Date) RANGE BETWEEN 5 PRECEDING AND CURRENT ROW

Anthocyanin answered 25/10, 2019 at 3:45 Comment(2)
So helpful, thanks! Didn't think about the UNIX_DATE() function to make RANGE queries on DATE fields. I wish BigQuery supported DATE natively in RANGE clauses! I'd like to go back 3 months, I have to approximate for 90 days with UNIX_RANGESubaquatic
UNIX_DATE() is way around for this in big query, thanks!Simpatico
E
3

Here is an alternate take that I found to be flexible and effective:

WITH users AS
 (SELECT 'Isabella' as user, 1 as spend, DATE(2020, 03, 28) as date
  UNION ALL SELECT 'Isabella', 2, DATE(2020, 03, 29)
  UNION ALL SELECT 'Daniel', 3, DATE(2020, 03, 24)
  UNION ALL SELECT 'Andrew', 4, DATE(2020, 03, 23)
  UNION ALL SELECT 'Daniel', 5, DATE(2020, 03, 11)
  UNION ALL SELECT 'Jose', 6, DATE(2020, 03, 17))
SELECT 
user,
max(sum(case date_diff(date(2020,04,15), date, day) between 0 and 28
        when true then spend else 0 end)) over(partition by user) as spend_28_day_sum
FROM users
group by user
+------------------------------+
| user      | spend_28_day_sum |
+------------------------------+
| Andrew    | 4                |
| Daniel    | 3                |
| Isabella  | 3                |
| Jose      | 0                |
+------------------------------+

You could change the specified date for the "window function" to current_date() or cross join with a generated date array to see how users change over time.

Epizoon answered 31/5, 2020 at 1:53 Comment(0)
A
1

I found a clean and elegant way to do this, even if you have missing data in the last days.

    SELECT spend,
       SUM(spend) OVER (PARTITION BY user ORDER BY UNIX_DATE(date) RANGE BETWEEN 27 PRECEDING AND CURRENT ROW),
       user,
       date
FROM user_spend;

The UNIX_DATE() returns the number of days since 1970-01-01, so we can easily compute how many days back to go by using it combined with the RANGE() function.

Athens answered 9/9, 2021 at 10:40 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.