At some point soon I'll have to prepare a list of prices of items on days. The granularity is 1 day and on days when there are sales of an item, I'll average the prices to get that day's average. There will be days where no sales are made, and I'm suited that an adequate approximation can be used by pulling the previous and next occurrences of sales, and for each day between them having a price that transitions linearly from one to the other.
Imagine the raw data is:
Item Date Price
Bread 2000-01-01 10
Bread 2000-01-02 9.5
Bread 2000-01-04 9.1
Sugar 2000-01-01 100
Sugar 2000-01-11 150
I can get to here:
Item Date Price
Bread 2000-01-01 10
Bread 2000-01-02 9.5
Bread 2000-01-03 NULL
Bread 2000-01-04 9.1
Sugar 2000-01-01 100
Sugar 2000-01-02 NULL
Sugar 2000-01-03 NULL
Sugar 2000-01-04 NULL
Sugar 2000-01-05 NULL
Sugar 2000-01-06 NULL
Sugar 2000-01-07 NULL
Sugar 2000-01-08 NULL
Sugar 2000-01-09 NULL
Sugar 2000-01-10 NULL
Sugar 2000-01-11 150
Where I want to get to is:
Item Date Price
Bread 2000-01-01 10
Bread 2000-01-02 9.5
Bread 2000-01-03 9.3 --being 9.5 + ((9.1 - 9.5 / 2) * 1)
Bread 2000-01-04 9.1
Sugar 2000-01-01 100
Sugar 2000-01-02 105 --being 100 + (150 - 100 / 10) * 1)
Sugar 2000-01-03 110 --being 100 + (150 - 100 / 10) * 2)
Sugar 2000-01-04 115
Sugar 2000-01-05 120
Sugar 2000-01-06 125
Sugar 2000-01-07 130
Sugar 2000-01-08 135
Sugar 2000-01-09 140
Sugar 2000-01-10 145 --being 100 + (150 - 100 / 10) * 9)
Sugar 2000-01-11 150
What have I tried so far? Thinking only; I'm planning on doing something like:
- Pull the raw data
- Join to a numbers/calendar table to pad out to sparse data
- LAST_VALUE() (or first?) OVER ROWS UNBOUNDED PRECEDING/FOLLOWING (with a nulls-last order clause) to get the first non-null preceding_date, following_date, preceding_price and following_price from the raw data
- DATEDIFF the fake date and the preceding_date to get a number of days (this is effectively how far across the gap we are, gap_progress) and the gap distance (following_date - preceding_date)
- get the next price, previous price and gap distance for the formula (preceding_price + ((next_price - preceding_price)/gap_distance) * gap_progress)
I am, however, wondering if there's a simpler way, because I've got millions of item-days and this doesn't feel like it'll be that efficient..
I find lots of examples of questions where the data from the last or next row is smeared verbatim to fill in the gaps, but I don't recall seeing this situation where some kind of transition is attempted. Perhaps this technique can be doubly applied, by having a smear that runs forwards, replicating the most recent value and alongside it a smear that runs backwards:
Item Date DateFwd DateBak PriceF PriceB
Bread 2000-01-01 2000-01-01 2000-01-01 10 10
Bread 2000-01-02 2000-01-02 2000-01-02 9.5 9.5
Bread 2000-01-03 2000-01-02 2000-01-04 9.5 9.1
Bread 2000-01-04 2000-01-04 2000-01-04 9.1 9.1
Sugar 2000-01-01 2000-01-01 2000-01-01 100 100
Sugar 2000-01-02 2000-01-01 2000-01-11 100 150
Sugar 2000-01-03 2000-01-01 2000-01-11 100 150
Sugar 2000-01-04 2000-01-01 2000-01-11 100 150
Sugar 2000-01-05 2000-01-01 2000-01-11 100 150
Sugar 2000-01-06 2000-01-01 2000-01-11 100 150
Sugar 2000-01-07 2000-01-01 2000-01-11 100 150
Sugar 2000-01-08 2000-01-01 2000-01-11 100 150
Sugar 2000-01-09 2000-01-01 2000-01-11 100 150
Sugar 2000-01-10 2000-01-01 2000-01-11 100 150
Sugar 2000-01-11 2000-01-11 2000-01-11 150 150
These might provide the necessary data for the formula
(preceding_price + ((next_price - preceding_price)/gap_distance) * gap_progress)
:
- gap_distance = DATEDIFF(day, DateFwd, DateBak)
- gap_progress = DATEDIFF(day, Date, DateFwd)
- next_price = PriceB
- preceding_price = PriceF
?
Here's a DDL of the data that I know I can get to (raw data joined with calendar table)
CREATE TABLE Data
([I] varchar(5), [D] date, [P] DECIMAL(10,5))
;
INSERT Data
([I], [D], [P])
VALUES
('Bread', '2000-01-01', 10),
('Bread', '2000-01-02', 9.5),
('Bread', '2000-01-04', 9.1),
('Sugar', '2000-01-01', 100),
('Sugar', '2000-01-11', 150);
CREATE TABLE Cal([D] DATE);
INSERT Cal VALUES
('2000-01-01'),
('2000-01-02'),
('2000-01-03'),
('2000-01-04'),
('2000-01-05'),
('2000-01-06'),
('2000-01-07'),
('2000-01-08'),
('2000-01-09'),
('2000-01-10'),
('2000-01-11');
SELECT d.i as [item], c.d as [date], d.p as [price] FROM
cal c LEFT JOIN data d ON c.d = d.d