Difference between "ROWS BETWEEN" and "RANGE BETWEEN" in (Presto) window function "OVER" clause
Asked Answered
T

1

16

This question is primarily about older versions of PrestoSQL, which have been resolved in the (now renamed) Trino project as of versions 346. However, Amazon's Athena project is based off of Presto versions 0.217 (Athena Engine 2) and 0.172 (Athena Engine 1), which does have the issues described below. This question was written specifically around Athena Engine 1 / PrestoSQL version 0.172

Questions (tl;dr)

  1. What is the difference between ROWS BETWEEN and RANGE BETWEEN in Presto window Functions?
    • Are these just synonyms for each other, or are there core conceptual differences?
    • If they are just synonyms, why does ROWS BETWEEN allow more options than RANGE BETWEEN?
  2. Is there a query scenario where it's possible to use the exact same parameters on ROWS BETWEEN and RANGE BETWEEN and get different results?
    • If using just unbounded/current row, is there a scenario where you'd use RANGE instead of ROWS (or vice-versa)?
  3. Since ROWS has more options, why isn't it mentioned at all in the documentation? o_O

Comments

The presto documentation is fairly quiet about even RANGE, and doesn't mention ROWS. I haven't found many discussions or examples around window functions in Presto. I'm starting to set through the Presto code-base to try to figure this out. Hopefully someone can save me from that, and we can improve the documentation together.

The Presto code has a parser and test cases for the ROWS variant, but there's no mention in the documentation of ROWS.

The test cases I found with both ROWS and RANGE don't test anything different between the two syntaxes.

They almost look like synonyms, but they do behave differently in my testing, and have different allowed parameters and validation rules.

The following examples can be run with the starburstdata/presto Docker image running Presto 0.213-e-0.1. Typically I run Presto 0.172 through Amazon Athena, and have almost always ended up using ROWS.

RANGE

RANGE seems to be limited to "UNBOUNDED" and "CURRENT ROW". The following returns an error:

range between 1 preceding and 1 following

use tpch.tiny;

select custkey, orderdate,
       array_agg(orderdate) over ( 
           partition by custkey 
           order by orderdate asc 
           range between 1 preceding and 1 following
       ) previous_orders 
from orders where custkey in (419, 320) and orderdate < date('1996-01-01')
order by custkey, orderdate asc;

ERROR: Window frame RANGE PRECEDING is only supported with UNBOUNDED

The following range syntaxes do work fine (with expected differing results). All following examples based on the above query, just changing the range

range between unbounded preceding and current row

 custkey | orderdate  |                             previous_orders
---------+------------+--------------------------------------------------------------------------
     320 | 1992-07-10 | [1992-07-10]
     320 | 1992-07-30 | [1992-07-10, 1992-07-30]
     320 | 1994-07-08 | [1992-07-10, 1992-07-30, 1994-07-08]
     320 | 1994-08-04 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04]
     320 | 1994-09-18 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18]
     320 | 1994-10-12 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
     419 | 1992-03-16 | [1992-03-16]
     419 | 1993-12-29 | [1992-03-16, 1993-12-29]
     419 | 1995-01-30 | [1992-03-16, 1993-12-29, 1995-01-30]

range between current row and unbounded following

 custkey | orderdate  |                             previous_orders
---------+------------+--------------------------------------------------------------------------
     320 | 1992-07-10 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
     320 | 1992-07-30 | [1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
     320 | 1994-07-08 | [1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
     320 | 1994-08-04 | [1994-08-04, 1994-09-18, 1994-10-12]
     320 | 1994-09-18 | [1994-09-18, 1994-10-12]
     320 | 1994-10-12 | [1994-10-12]
     419 | 1992-03-16 | [1992-03-16, 1993-12-29, 1995-01-30]
     419 | 1993-12-29 | [1993-12-29, 1995-01-30]
     419 | 1995-01-30 | [1995-01-30]

range between unbounded preceding and unbounded following

 custkey | orderdate  |                             previous_orders
---------+------------+--------------------------------------------------------------------------
     320 | 1992-07-10 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
     320 | 1992-07-30 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
     320 | 1994-07-08 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
     320 | 1994-08-04 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
     320 | 1994-09-18 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
     320 | 1994-10-12 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
     419 | 1992-03-16 | [1992-03-16, 1993-12-29, 1995-01-30]
     419 | 1993-12-29 | [1992-03-16, 1993-12-29, 1995-01-30]
     419 | 1995-01-30 | [1992-03-16, 1993-12-29, 1995-01-30]

ROWS

The three working examples for RANGE above all work for ROWS and produce identical output.

rows between unbounded preceding and current row
rows between current row and unbounded following
rows between unbounded preceding and unbounded following

output omitted - identical to above

However, ROWS allows for far more control, since you can also do the syntax above that fails with range:

rows between 1 preceding and 1 following

 custkey | orderdate  |           previous_orders
---------+------------+--------------------------------------
     320 | 1992-07-10 | [1992-07-10, 1992-07-30]
     320 | 1992-07-30 | [1992-07-10, 1992-07-30, 1994-07-08]
     320 | 1994-07-08 | [1992-07-30, 1994-07-08, 1994-08-04]
     320 | 1994-08-04 | [1994-07-08, 1994-08-04, 1994-09-18]
     320 | 1994-09-18 | [1994-08-04, 1994-09-18, 1994-10-12]
     320 | 1994-10-12 | [1994-09-18, 1994-10-12]
     419 | 1992-03-16 | [1992-03-16, 1993-12-29]
     419 | 1993-12-29 | [1992-03-16, 1993-12-29, 1995-01-30]
     419 | 1995-01-30 | [1993-12-29, 1995-01-30]

rows between current row and 1 following

 custkey | orderdate  |     previous_orders
---------+------------+--------------------------
     320 | 1992-07-10 | [1992-07-10, 1992-07-30]
     320 | 1992-07-30 | [1992-07-30, 1994-07-08]
     320 | 1994-07-08 | [1994-07-08, 1994-08-04]
     320 | 1994-08-04 | [1994-08-04, 1994-09-18]
     320 | 1994-09-18 | [1994-09-18, 1994-10-12]
     320 | 1994-10-12 | [1994-10-12]
     419 | 1992-03-16 | [1992-03-16, 1993-12-29]
     419 | 1993-12-29 | [1993-12-29, 1995-01-30]
     419 | 1995-01-30 | [1995-01-30]

rows between 5 preceding and 2 preceding

 custkey | orderdate  |                 previous_orders
---------+------------+--------------------------------------------------
     320 | 1992-07-10 | NULL
     320 | 1992-07-30 | NULL
     320 | 1994-07-08 | [1992-07-10]
     320 | 1994-08-04 | [1992-07-10, 1992-07-30]
     320 | 1994-09-18 | [1992-07-10, 1992-07-30, 1994-07-08]
     320 | 1994-10-12 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04]
     419 | 1992-03-16 | NULL
     419 | 1993-12-29 | NULL
     419 | 1995-01-30 | [1992-03-16]
Turnpike answered 19/2, 2020 at 14:17 Comment(5)
range is used to define a window that covers things like the last 6 months regardless of how many rows that includes. But I don't know Presto.Elvieelvin
@a_horse_with_no_name So it sounds like other SQL engines / syntax allow for RANGE based on column values, which Presto does not appear to support. If that's the case, then Presto should be leaning more heavily into documenting the ROWS function, since that's basically all it does support. That does make sense - "UNBOUNDED" would be the same in both scenarios.Turnpike
See here for some explanation: modern-sql.com/blog/2019-02/postgresql-11#overElvieelvin
sqlitetutorial.net/sqlite-window-functions/sqlite-window-frame This explanation from SQLite also quite aligns with Presto usage behavior.Zermatt
Just want to chime in and thank you for the thoughtfulness, diligence, and commitment to clarity of this post. This is one of the "gems" we're all hoping to find when casting about desperately on SO. Kudos!Brakeman
V
18
  • ROWS are literally number of rows before and after that you want to aggregate. So ORDER BY day ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING will end up with 3 rows: the curnet row 1 row before and 1 row after, regardless of the value of orderdate.
  • RANGE will look at the values of orderdate and will decide what should be aggregated and what not. So ORDER BY day RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING would theoretically take all lines with values of orderdate-1, orderdate and orderdate+1 - this can be more than 3 lines (see more explanations here)

In Presto the ROWS is fully implemented, but the RANGE is somehow only partially implemented, and you can only use in with CURRENT ROW and UNBOUNDED.

NOTE: Recent versions of Trino (formerly known as Presto SQL) have full support for RANGE and GROUPS framing. See this blog post for an explanation of how they work.

The best way in Presto, to be able to see the diff between the two, is to make sure you have same values of the order clause:

WITH
   tt1  (custkey, orderdate, product) AS 
      ( SELECT * FROM ( VALUES ('a','1992-07-10', 3), ('a','1993-08-10', 4), ('a','1994-07-13', 5), ('a','1995-09-13', 5), ('a','1995-09-13', 9), ('a','1997-01-13', 4),
                               ('b','1992-07-10', 6), ('b','1992-07-10', 4), ('b','1994-07-13', 5), ('b','1994-07-13', 9), ('b','1998-11-11', 9) )  )

SELECT *, 
       array_agg(product) OVER (partition by custkey) c, 
       array_agg(product) OVER (partition by custkey order by orderdate) c_order,
       
       array_agg(product) OVER (partition by custkey order by orderdate RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) range_ubub,
       array_agg(product) OVER (partition by custkey order by orderdate ROWS  BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) rows_ubub,
       
       array_agg(product) OVER (partition by custkey order by orderdate RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) range_ubc,
       array_agg(product) OVER (partition by custkey order by orderdate ROWS  BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) rows_ubc,
       
       array_agg(product) OVER (partition by custkey order by orderdate RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) range_cub,
       array_agg(product) OVER (partition by custkey order by orderdate ROWS  BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) rows_cub,
       
       -- array_agg(product) OVER (partition by custkey order by orderdate RANGE BETWEEN 2 PRECEDING AND 2 FOLLOWING)  range22,
          -- SYNTAX_ERROR: line 19:65: Window frame RANGE PRECEDING is only supported with UNBOUNDED
       array_agg(product) OVER (partition by custkey order by orderdate ROWS  BETWEEN 2 PRECEDING AND 2 FOLLOWING)  rows22

from tt1
order by custkey, orderdate, product

You can run, and see full results, and learn from them..

I'll put here only some interesting columns:

custkey   orderdate     product    range_ubc           rows_ubc
a         10/07/1992    3          [3]                 [3]
a         10/08/1993    4          [3, 4]              [3, 4]
a         13/07/1994    5          [3, 4, 5]           [3, 4, 5]
a         13/09/1995    5          [3, 4, 5, 5, 9]     [3, 4, 5, 5]
a         13/09/1995    9          [3, 4, 5, 5, 9]     [3, 4, 5, 5, 9]
a         13/01/1997    4          [3, 4, 5, 5, 9, 4]  [3, 4, 5, 5, 9, 4]
b         10/07/1992    4          [6, 4]              [6, 4]
b         10/07/1992    6          [6, 4]              [6]
b         13/07/1994    5          [6, 4, 5, 9]        [6, 4, 5]
b         13/07/1994    9          [6, 4, 5, 9]        [6, 4, 5, 9]
b         11/11/1998    9          [6, 4, 5, 9, 9]     [6, 4, 5, 9, 9]

If you look at the 5th line of: orderdate:13/09/1995, product:5 (Note: 13/09/1995 appears twice for custkey:a) you can see that the ROWS indeed took all rows from top till current line. But if you look at the RANGE, you see it includes also the value from the row after as it has the exact same orderdate so it is considered in same window.

Verney answered 5/8, 2020 at 18:50 Comment(3)
Thanks! Very clear :) Is it also possible to only do the calculation if the entire window is full? Or should you wrap it in a case when for that? :/Mangan
Thanks. I'm not aware of any other way except wrapping it in a case when..Verney
Encouraging to see Trino (formerly Presto) has more fully implemented the functionality - thank you for that. For context, Amazon's Athena product is still built off of Presto 0.217 (Athena engine v2) or Presto 0.172 (Athena engine v1), so the functionality is still limited for AWS Athena. Hopefully they take a much bigger jump in versions for the v3 engine and we get to see these enhancements there.Turnpike

© 2022 - 2024 — McMap. All rights reserved.