Use T-SQL window functions to retrieve 5-minute averages from 1-minute data
Asked Answered
S

3

5

I have a database table containing one-minute periods of Open, Close, High, Low, Volume values for a security. I'm using SQL Server 2017, but 2019 RC is an option.

I am trying to find an efficient SQL Server query that can aggregate these into 5-minute windows, where:

  • Open = first Open value of the window
  • Close = last Close value of the window
  • High = max High value of the window
  • Low = min Low value of the window
  • Volume = avg Volume across the window

Ideally this query would account for gaps in the data, i.e. be based on date calculations rather than counting preceding / following rows.

For example say I have (here's 6 mins of data):

| Time             | Open | Close | High | Low | Volume |
|------------------|------|-------|------|-----|--------|
| 2019-10-30 09:30 | 5    | 10    | 15   | 1   | 125000 |
| 2019-10-30 09:31 | 10   | 15    | 20   | 5   | 100000 |
| 2019-10-30 09:32 | 15   | 20    | 25   | 10  | 120000 |
| 2019-10-30 09:33 | 20   | 25    | 30   | 15  | 10000  |
| 2019-10-30 09:34 | 20   | 22    | 40   | 2   | 13122  |
| 2019-10-30 09:35 | 22   | 30    | 35   | 4   | 15000  | Not factored in, since this would be the first row of the next 5-minute window

I am trying to write a query that would give me (here's the first example of the 5-minute aggregate):

| Time             | Open | Close | High | Low | Volume  |
|------------------|------|-------|------|-----|---------|
| 2019-10-30 09:30 | 5    | 30    | 40   | 1   | 50224.4 |

Any tips? Am banging my head against the wall with the OVER clause and its PARTITION / RANGE options

Salpingectomy answered 4/11, 2019 at 7:54 Comment(1)
It would be a plus point if you share the script to generate sample data!Hauberk
H
5

The gist of the problem is rounding datetime values to 5 minute boundary which (assuming that the datatype is datetime) could be done using DATEADD(MINUTE, DATEDIFF(MINUTE, 0, time) / 5 * 5, 0). Rest is basic grouping/window functions:

WITH cte AS (
  SELECT clamped_time
       , [Open]
       , [Close]
       , [High]
       , [Low]
       , [Volume]
       , rn1 = ROW_NUMBER() OVER (PARTITION BY clamped_time ORDER BY [Time])
       , rn2 = ROW_NUMBER() OVER (PARTITION BY clamped_time ORDER BY [Time] DESC)
  FROM t
  CROSS APPLY (
      SELECT DATEADD(MINUTE, DATEDIFF(MINUTE, 0, time) / 5 * 5, 0)
  ) AS x(clamped_time)
)
SELECT clamped_time
     , MIN(CASE WHEN rn1 = 1 THEN [Open] END) AS [Open]
     , MIN(CASE WHEN rn2 = 1 THEN [Close] END) AS [Close]
     , MAX([High]) AS [High]
     , MIN([Low]) AS [Low]
     , AVG([Volume])
FROM cte
GROUP BY clamped_time

Demo on db<>fiddle

Habitual answered 4/11, 2019 at 8:26 Comment(3)
Thank you so much! It's so fast!! I had got this working from 2015 but it took 3 seconds across a few thousand rows for one security sqlfingers.com/2015/06/… - your solution generates equivalent values in 2 seconds across 500k rows (or 100ms per security, which is how it will be run). Thanks!Salpingectomy
OOC what was the reasoning why you edited it to include the rn1/rn2 for calculating the Open/Close? So that I can understand. I tried shortcutting by having this instead of the CASE statements: LAST_VALUE([Close]) OVER (PARTITION BY clamped_time ORDER BY t.PeriodOpen DESC) but I'm assuming you had a reason for not doing that?Salpingectomy
@user12319070 it is much simpler with row_number. With LAST_VALUE you need a ROWS BETWEEN clause which is something I am not familiar with. But you're welcome to use it if creates a better plan. Re speed: if you want it faster I suggest creating a persisted, computed column to store the 5 minute boundary and create an index on (x_time, time) include (open, close, ...).Habitual
C
3

You want to analyze data by 5 minutes intervals. You could use window functions with the following partitioning clause:

partition by datepart(year, t.[time]),
    datepart(month, t.[time]),
    datepart(day, t.[time]),
    datepart(hour, t.[time]),
    (datepart(minute, t.[time]) / 5)

Query:

select *
from (
    select  
        t.time,
        row_number() over(
            partition by datepart(year, [time]),
                datepart(month, [time]),
                datepart(day, [time]),
                datepart(hour, [time]),
                (datepart(minute, [time]) / 5)
            order by [time]
        ) [rn],
        first_value([open]) over(
            partition by datepart(year, [time]),
                datepart(month, [time]),
                datepart(day, [time]),
                datepart(hour, [time]),
                (datepart(minute, [time]) / 5)
            order by [time]
        ) [open],
        last_value([close]) over(
            partition by datepart(year, [time]),
                datepart(month, [time]),
                datepart(day, [time]),
                datepart(hour, [time]),
                (datepart(minute, [time]) / 5)
            order by [time]
        ) [close],
        max([high]) over (
            partition by datepart(year, [time]),
                datepart(month, [time]),
                datepart(day, [time]),
                datepart(hour, [time]),
                (datepart(minute, [time]) / 5)
        ) [high],
        min([low]) over (
            partition by datepart(year, [time]),
                datepart(month, [time]),
                datepart(day, [time]),
                datepart(hour, [time]),
                (datepart(minute, [time]) / 5)
        ) [low],
        avg([volume]) over (
            partition by datepart(year, [time]),
                datepart(month, [time]),
                datepart(day, [time]),
                datepart(hour, [time]),
                (datepart(minute, [time]) / 5)
        ) [volume]
    from mytable t
) t
where rn = 1
Comptroller answered 4/11, 2019 at 8:16 Comment(2)
Thank you very much! I tried your version too, and it definitely helped me understand partitioning logic for the over clause. It also generates fine results, though the accepted answer appears to be slightly faster to execute and is shorter. Tough choice tho!Salpingectomy
I think that you made the good choice, I find that the solution by @SalmanA is actuallly better than mine, because the definition of the partition is more concise.Comptroller
A
0

you can try this.

  SELECT
      MIN([Time]) [Time], 
      Min([Open]) [Open],
      LEAD(Min([Open])) OVER (ORDER BY MIN([Time])) AS [Close],
      Max([High]) [High], 
      Min([Low]) [Low], 
      Avg(Volume) Volume
  FROM SampleData
  GROUP BY DATEADD(Minute, -1* DATEPART(Minute, Time) %5, Time)

sql fiddle

Ambriz answered 4/11, 2019 at 8:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.