Extrapolate daily historical values from a table that only records when a value changes (Postgresql 9.3)
Asked Answered
R

4

11

I have a table that records a row for each time a score for a location has changed.

score_history:

  • id int PK (uuid auto incrementing int)
  • happened_at timestamp (when the score changed)
  • location_id int FK (the location that the value is for)
  • score float (the new score)

This was done with an eye on efficiency and being able to simply retrieve a list of changes for a given location and serves that purpose nicely.

I'm trying to output the data in a very redundant format to help load it into a rigid external system. The external system expects a row for each location * every date. The goal is to represent the last score value for each location for each date. So if the score changed 3 times in a given date only the score closest to midnight would be considered that locations closing score for the day. I imagine this is similar to the challenge of creating a close of business inventory level fact table.

I have a handy star schema style date dimension table which has a row for every date fully covering this sample period and well into the future.

That table looks like

dw_dim_date:

  • date date PK
  • a bunch of other columns like week number, is_us_holiday etc.

So, if I had only 3 records in the score_history table...

1, 2019-01-01:10:13:01, 100, 5.0
2, 2019-01-05:20:00:01, 100, 5.8
3, 2019-01-05:23:01:22, 100, 6.2

The desired output would be:

2019-01-01, 100, 5.0 
2019-01-02, 100, 5.0 
2019-01-03, 100, 5.0
2019-01-04, 100, 5.0 
2019-01-05, 100, 6.2

3 Requirements:

  1. One row per day per location even if there are no score records for that day.
  2. If there are score records for that day the last one before midnight should be the score value for the row. In the event of a tie the greater of the two should "win".
  3. If there are zero score records for that day the score should be the most recent previous score.

I've been chasing my tail through subqueries and window functions.

Because I'm hesitant to post something without something I tried I'll share this trainwreck which produces output but of no meaning...

SELECT dw_dim_date.date,
       (SELECT score 
        FROM score_history 
        WHERE score_history.happened_at::DATE < dw_dim_date.date 
           OR score_history.happened_at::DATE = dw_dim_date.date 
        ORDER BY score_history.id desc limit 1) as last_score
FROM dw_dim_date
WHERE dw_dim_date.date > '2019-06-01'

Grateful for guidance or pointers to other questions to read.

Riddick answered 3/7, 2019 at 15:48 Comment(2)
I'm not looking for the MAX value within a day. I'm looking for the last value for a day. And many days will have no values at all.Riddick
1) One row per day per location even if there are no score records for that day. 2) If there are score records for that day the last one before midnight should be the score value for the row 3) If there are zero score records for that day the score should be the most recent previous score.Riddick
R
5

You could achieve it with usage of correlated subqueries and LATERAL:

SELECT sub.date, sub.location_id, score
FROM (SELECT * FROM dw_dim_date
      CROSS JOIN (SELECT DISTINCT location_id FROM score_history) s
      WHERE date >= '2019-01-01'::date) sub
,LATERAL(SELECT score FROM score_history sc 
         WHERE sc.happened_at::date <= sub.date
           AND sc.location_id = sub.location_id
         ORDER BY happened_at DESC LIMIT 1) l
,LATERAL(SELECT MIN(happened_at::date) m1, MAX(happened_at::date) m2 
         FROM score_history sc
         WHERE sc.location_id = sub.location_id) lm
WHERE sub.date BETWEEN lm.m1 AND lm.m2
ORDER BY location_id, date;

db<>fiddle demo

How it works:

1) s (it is cross join of all dates per location_id)

2) l (selecting score per location)

3) lm (selecting min/max date per location for filtering)

4) WHERE filter dates on range that is available, it could be relaxed if needed

Rowboat answered 6/7, 2019 at 8:23 Comment(2)
Thank you this was a very interesting introduction to LATERAL and also seems to be pretty performant with a largish data set.Riddick
@Riddick Great to hear that :)Rowboat
H
2

I think you can try something like this. The main things I changed are wrapping things in DATE() and using another SO answer for the date finder:

SELECT
  dw_dim_date.date,
  (
    SELECT
      score
    FROM
      score_history
    WHERE
      DATE(score_history.happened_at) <= dw_dim_date.date
    ORDER BY
      score_history.happened_at DESC
    LIMIT
      1
  ) as last_score
FROM
  dw_dim_date
WHERE
  dw_dim_date.date >= DATE('2019-01-01')

This uses the SQL method from here to find the nearest past data to the one requested: PostgreSQL return exact or closest date to queried date

Hardily answered 6/7, 2019 at 4:12 Comment(3)
You may have meant >= DATE('2019-01-01') in your last line?Harriette
I'm not sure. In the original question it is without the = and has 2019-06-01 so I original stuck with it. Updating it now.Hardily
Thank you but this will work for the last score across all location_ids for a day but not for each location_id. Each date needs 1 entry for every location_id. Again the scenario is capturing something like a restaurant rating at the end of every day.Riddick
H
0
WITH
max_per_day_location AS (
SELECT
    SH.happened_at::DATE as day,
    SH.location_id,
    max(SH.happened_at) as happened_at
FROM
    score_history SH
GROUP BY
    SH.happened_at::DATE,
    SH.location_id
),
date_location AS (
SELECT DISTINCT
    DD."date",
    SH.location_id
FROM
    dw_dim_date DD,
    max_per_day_location SH
),
value_partition AS (
SELECT
    DD."date",
    DD.location_id,
    SH.score,
    SH.happened_at,
    MPD.happened_at as hap2,
    sum(case when score is null then 0 else 1 end) OVER
    (PARTITION BY DD.location_id ORDER BY "date", SH.happened_at desc) AS value_partition
FROM
    date_location DD
    LEFT JOIN score_history SH
    ON DD."date" = SH.happened_at::DATE
    AND DD.location_id = SH.location_id
    LEFT join max_per_day_location MPD
    ON SH.happened_at = MPD.happened_at
WHERE NOT (MPD.happened_at IS NULL
           AND
           SH.happened_at IS NOT NULL)
ORDER BY
    DD."date"
),
final AS (
SELECT
    "date",
    location_id,
    first_value(score) over w
FROM
    value_partition
WINDOW w AS (PARTITION BY location_id, value_partition
             ORDER BY happened_at rows between unbounded preceding and unbounded following)
order by "date"
)
SELECT DISTINCT * FROM final ORDER BY location_id, date
;

I'm sure there are less verbose ways to do this.

I've got a SQLFiddle with some test data here: http://sqlfiddle.com/#!17/9d122/1

The main thing that makes this work is making a "value partition" to access the previous non null value. More here:

The date_location subquery just makes a single row per location_id per day since that is the base "row level" desired in the output.

The max_per_day_location subquery is used to filter out the earlier entries for location/day combos that have multiple scores and only keep the last one for that day.

Harriette answered 6/7, 2019 at 6:43 Comment(0)
P
0

The simplest solution would probably be:

    select dw_dim_date.date, location_id, score
    from dw_dim_date, score_history S1
    where happened_at::date  <= dw_dim_date.date and 
          not exists (select * 
                      from score_history S2 
                      where S2.happened_at::date  <= dw_dim_date.date and 
                            S1.happened_at< S2.happened_at and
                            S1.location_id = S2.location_id)

This computes a Cartesian product between the dates and score histories, and then takes, for each date and location, the score for which there does not exist a later score (within the date period). I would suggest starting with this, since it would probably be the easiest to maintain, and only going with a more complex solution if this is not efficient enough (with appropriate indexes).

As SQL Fiddle on this is at https://dbfiddle.uk/?rdbms=postgres_9.4&fiddle=3c2e4ae49cbc43f7840b942d223be119

Puncheon answered 8/7, 2019 at 6:3 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.