Fill in gaps in data, using a value proportional to the gap distance to data from the surrounding rows?
Asked Answered
L

4

6

At some point soon I'll have to prepare a list of prices of items on days. The granularity is 1 day and on days when there are sales of an item, I'll average the prices to get that day's average. There will be days where no sales are made, and I'm suited that an adequate approximation can be used by pulling the previous and next occurrences of sales, and for each day between them having a price that transitions linearly from one to the other.

Imagine the raw data is:

Item   Date       Price
Bread  2000-01-01 10
Bread  2000-01-02 9.5
Bread  2000-01-04 9.1
Sugar  2000-01-01 100
Sugar  2000-01-11 150

I can get to here:

Item   Date       Price
Bread  2000-01-01 10
Bread  2000-01-02 9.5
Bread  2000-01-03 NULL
Bread  2000-01-04 9.1
Sugar  2000-01-01 100
Sugar  2000-01-02 NULL
Sugar  2000-01-03 NULL
Sugar  2000-01-04 NULL
Sugar  2000-01-05 NULL
Sugar  2000-01-06 NULL
Sugar  2000-01-07 NULL
Sugar  2000-01-08 NULL
Sugar  2000-01-09 NULL
Sugar  2000-01-10 NULL
Sugar  2000-01-11 150

Where I want to get to is:

Item   Date       Price
Bread  2000-01-01 10
Bread  2000-01-02 9.5
Bread  2000-01-03 9.3 --being 9.5 + ((9.1 - 9.5 / 2) * 1)
Bread  2000-01-04 9.1
Sugar  2000-01-01 100
Sugar  2000-01-02 105 --being 100 + (150 - 100 / 10) * 1)
Sugar  2000-01-03 110 --being 100 + (150 - 100 / 10) * 2)
Sugar  2000-01-04 115
Sugar  2000-01-05 120
Sugar  2000-01-06 125
Sugar  2000-01-07 130
Sugar  2000-01-08 135
Sugar  2000-01-09 140
Sugar  2000-01-10 145 --being 100 + (150 - 100 / 10) * 9)
Sugar  2000-01-11 150

What have I tried so far? Thinking only; I'm planning on doing something like:

  • Pull the raw data
  • Join to a numbers/calendar table to pad out to sparse data
  • LAST_VALUE() (or first?) OVER ROWS UNBOUNDED PRECEDING/FOLLOWING (with a nulls-last order clause) to get the first non-null preceding_date, following_date, preceding_price and following_price from the raw data
  • DATEDIFF the fake date and the preceding_date to get a number of days (this is effectively how far across the gap we are, gap_progress) and the gap distance (following_date - preceding_date)
  • get the next price, previous price and gap distance for the formula (preceding_price + ((next_price - preceding_price)/gap_distance) * gap_progress)

I am, however, wondering if there's a simpler way, because I've got millions of item-days and this doesn't feel like it'll be that efficient..

I find lots of examples of questions where the data from the last or next row is smeared verbatim to fill in the gaps, but I don't recall seeing this situation where some kind of transition is attempted. Perhaps this technique can be doubly applied, by having a smear that runs forwards, replicating the most recent value and alongside it a smear that runs backwards:

Item   Date       DateFwd    DateBak     PriceF PriceB
Bread  2000-01-01 2000-01-01 2000-01-01  10     10
Bread  2000-01-02 2000-01-02 2000-01-02  9.5    9.5
Bread  2000-01-03 2000-01-02 2000-01-04  9.5    9.1
Bread  2000-01-04 2000-01-04 2000-01-04  9.1    9.1
Sugar  2000-01-01 2000-01-01 2000-01-01  100    100
Sugar  2000-01-02 2000-01-01 2000-01-11  100    150
Sugar  2000-01-03 2000-01-01 2000-01-11  100    150
Sugar  2000-01-04 2000-01-01 2000-01-11  100    150
Sugar  2000-01-05 2000-01-01 2000-01-11  100    150
Sugar  2000-01-06 2000-01-01 2000-01-11  100    150
Sugar  2000-01-07 2000-01-01 2000-01-11  100    150
Sugar  2000-01-08 2000-01-01 2000-01-11  100    150
Sugar  2000-01-09 2000-01-01 2000-01-11  100    150
Sugar  2000-01-10 2000-01-01 2000-01-11  100    150
Sugar  2000-01-11 2000-01-11 2000-01-11  150    150

These might provide the necessary data for the formula (preceding_price + ((next_price - preceding_price)/gap_distance) * gap_progress):

  • gap_distance = DATEDIFF(day, DateFwd, DateBak)
  • gap_progress = DATEDIFF(day, Date, DateFwd)
  • next_price = PriceB
  • preceding_price = PriceF

?

Here's a DDL of the data that I know I can get to (raw data joined with calendar table)

CREATE TABLE Data
([I] varchar(5), [D] date, [P] DECIMAL(10,5))
;

INSERT Data
([I], [D], [P])
VALUES
('Bread', '2000-01-01', 10),
('Bread', '2000-01-02', 9.5),
('Bread', '2000-01-04', 9.1),
('Sugar', '2000-01-01', 100),
('Sugar', '2000-01-11', 150);

CREATE TABLE Cal([D] DATE);
INSERT Cal VALUES
('2000-01-01'),
('2000-01-02'),
('2000-01-03'),
('2000-01-04'),
('2000-01-05'),
('2000-01-06'),
('2000-01-07'),
('2000-01-08'),
('2000-01-09'),
('2000-01-10'),
('2000-01-11');

SELECT d.i as [item], c.d as [date], d.p as [price] FROM
cal c LEFT JOIN data d ON c.d = d.d
Lewallen answered 18/9, 2018 at 13:20 Comment(4)
Can you share ddl and sample data along with what you have done so far? It's a whole lot easier to start with something instead of having to create tables and data to recreate what you have done already.Rode
Sample DDL: sqlfiddle.com/#!18/d8e0a - no SQL so far, only thinking - explained in the questionLewallen
sql fiddle seems to be down. Can't get there. Would be MUCH better posted in your question.Rode
Was working for me, apologies. I've copied the DDL from SQFiddle to the questionLewallen
O
1

it is easier to generate those missing gap together with the Price in one go

So i start off with your original raw data

CREATE TABLE t
    ([I] varchar(5), [D] date, [P] DECIMAL(10,2))
;

INSERT INTO t
    ([I], [D], [P])
VALUES
    ('Bread', '2000-01-01 00:00:00', '10'),
    ('Bread', '2000-01-02 00:00:00', '9.5'),
    ('Bread', '2000-01-04 00:00:00', '9.1'),
    ('Sugar', '2000-01-01 00:00:00', '100'),
    ('Sugar', '2000-01-11 00:00:00', '150');

; with
-- number is a tally table. here i use recursive cte to generate 100 numbers
number as
(
    select  n = 0
    union all
    select  n = n + 1
    from    number
    where   n < 99
),
-- a cte to get the Price of next date and also day diff
cte as
(
    select  *, 
            nextP = lead(P) over(partition by I order by D),
            cnt = datediff(day, D, lead(D) over(partition by I order by D)) - 1
    from    t
) 
select  I, 
        D = dateadd(day, n, D), 
        P = coalesce(c.P + (c.nextP - c.P) / ( cnt + 1) * n, c.P)
from    cte c
        cross join number n
where   n.n <= isnull(c.cnt, 0)

drop table t
Octal answered 18/9, 2018 at 13:58 Comment(1)
That's a good point; sparsing out the data first makes it harder, when analytics like LEAD can just give us the next day right away, before the data is sparsedLewallen
K
4

You can use OUTER APPLY to get the previous and next row with a price that is not null:

select
  d.item,
  d.date,
  case when d.price is null then
    prev.price + ( (next.price - prev.price) /
                   datediff(day, prev.date, next.date) *
                   datediff(day, prev.date, d.date)
                 )
  else
    d.price
  end as price
from data d
outer apply
(
    select top(1) *
    from data d2
    where d2.item = d.item and d2.date < d.date and d2.price is not null
    order by d2.date desc
) prev
outer apply
(
    select top(1) *
    from data d2
    where d2.item = d.item and d2.date > d.date and d2.price is not null
    order by d2.date
) next;

Rextester demo: http://rextester.com/QBL7472

UPDATE: This is probably slow. Perhaps it helps to add and d.price is null to the where clauses in the subqueries to show the DBMS that it doesn't have to actually look for other records when the price is not null. Just check the explain plans to see if that helps.

Kesterson answered 18/9, 2018 at 13:46 Comment(0)
N
1

I would put your formula 100 + (150 - 100 / 10) * 9) etc into a scalar UDF and use it in a persisted computed column.

Nephology answered 18/9, 2018 at 13:38 Comment(4)
Plus the formula is dynamic based on the number of rows. I don't think a computed column will work well here.Rode
My thought is that there is not going to be a SELECT-time solution that will be performant, so make this a one-time operation. If you can't make any DDL changes at all, then I would push back and say that this calculation needs to be performed at in the application layer, if you want good performance at SELECT-time. Another option would be to perform this calculation in a job, if "real-time" can be compromised.Nephology
@SeanLange That's why I suggest a UDF, to encapsulate multi-statement logic.Nephology
The idea not to do this ad-hoc in a query is good. Maybe have another column estimated_price that gets filled in an after statement trigger.Kesterson
O
1

it is easier to generate those missing gap together with the Price in one go

So i start off with your original raw data

CREATE TABLE t
    ([I] varchar(5), [D] date, [P] DECIMAL(10,2))
;

INSERT INTO t
    ([I], [D], [P])
VALUES
    ('Bread', '2000-01-01 00:00:00', '10'),
    ('Bread', '2000-01-02 00:00:00', '9.5'),
    ('Bread', '2000-01-04 00:00:00', '9.1'),
    ('Sugar', '2000-01-01 00:00:00', '100'),
    ('Sugar', '2000-01-11 00:00:00', '150');

; with
-- number is a tally table. here i use recursive cte to generate 100 numbers
number as
(
    select  n = 0
    union all
    select  n = n + 1
    from    number
    where   n < 99
),
-- a cte to get the Price of next date and also day diff
cte as
(
    select  *, 
            nextP = lead(P) over(partition by I order by D),
            cnt = datediff(day, D, lead(D) over(partition by I order by D)) - 1
    from    t
) 
select  I, 
        D = dateadd(day, n, D), 
        P = coalesce(c.P + (c.nextP - c.P) / ( cnt + 1) * n, c.P)
from    cte c
        cross join number n
where   n.n <= isnull(c.cnt, 0)

drop table t
Octal answered 18/9, 2018 at 13:58 Comment(1)
That's a good point; sparsing out the data first makes it harder, when analytics like LEAD can just give us the next day right away, before the data is sparsedLewallen
I
1

This will work for sql-server-2012+ Test table:

DECLARE @t table

(Item char(5), Date date, Price decimal(9,1))

INSERT @t values
('Bread','2000-01-01', 10),
('Bread','2000-01-02',  9.5),
('Bread','2000-01-04',  9.1),
('Sugar','2000-01-01',  100),
('Sugar','2000-01-11',  150)

Query

;WITH CTE as
(
  SELECT
    Item, Date, Price,
    lead(price) over(partition by Item order by Date) nextprice,
    lead(Date) over(partition by Item order by Date) nextDate
  FROM @t
), N(N) as
(
  SELECT 1 FROM(VALUES(1),(1),(1),(1),(1),(1),(1),(1),(1),(1))M(N)
), tally(N) as
(
  SELECT ROW_NUMBER()OVER(ORDER BY N.N)FROM N,N a,N b,N c,N d,N e,N f
)
SELECT 
  dateadd(d, coalesce(r, 0), Date) Date,
  Item, 
  CAST(price + coalesce((nextprice-price) * r 
    / datediff(d, date, nextdate), 0) as decimal(10,1)) Price
FROM CTE
OUTER APPLY
(
  SELECT top(coalesce(datediff(d, date, nextdate), 0)) 
    row_number() over (order by (select 1))-1 r
  FROM N
) z
ORDER BY item, date

Result:

Date    Item    Price
2000-01-01  Bread   10.0
2000-01-02  Bread   9.5
2000-01-03  Bread   9.3
2000-01-04  Bread   9.1
2000-01-01  Sugar   100.0
2000-01-02  Sugar   105.0
2000-01-03  Sugar   110.0
2000-01-04  Sugar   115.0
2000-01-05  Sugar   120.0
2000-01-06  Sugar   125.0
2000-01-07  Sugar   130.0
2000-01-08  Sugar   135.0
2000-01-09  Sugar   140.0
2000-01-10  Sugar   145.0
2000-01-11  Sugar   150.0
Infrequency answered 18/9, 2018 at 14:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.