Can window function LAG reference the column which value is being calculated?
Asked Answered
E

3

9

I need to calculate value of some column X based on some other columns of the current record and the value of X for the previous record (using some partition and order). Basically I need to implement query in the form

SELECT <some fields>, 
  <some expression using LAG(X) OVER(PARTITION BY ... ORDER BY ...) AS X
FROM <table>

This is not possible because only existing columns can be used in window function so I'm looking way how to overcome this.

Here is an example. I have a table with events. Each event has type and time_stamp.

create table event (id serial, type integer, time_stamp integer);

I wan't to find "duplicate" events (to skip them). By duplicate I mean the following. Let's order all events for given type by time_stamp ascending. Then

  1. the first event is not a duplicate
  2. all events that follow non duplicate and are within some time frame after it (that is their time_stamp is not greater then time_stamp of the previous non duplicate plus some constant TIMEFRAME) are duplicates
  3. the next event which time_stamp is greater than previous non duplicate by more than TIMEFRAME is not duplicate
  4. and so on

For this data

insert into event (type, time_stamp) 
 values 
  (1, 1), (1, 2), (2, 2), (1,3), (1, 10), (2,10), 
  (1,15), (1, 21), (2,13), 
  (1, 40);

and TIMEFRAME=10 result should be

time_stamp | type | duplicate
-----------------------------
        1  |    1 | false
        2  |    1 | true     
        3  |    1 | true 
       10  |    1 | true 
       15  |    1 | false 
       21  |    1 | true
       40  |    1 | false
        2  |    2 | false
       10  |    2 | true
       13  |    2 | false

I could calculate the value of duplicate field based on current time_stamp and time_stamp of the previous non-duplicate event like this:

WITH evt AS (
  SELECT 
    time_stamp, 
    CASE WHEN 
      time_stamp - LAG(current_non_dupl_time_stamp) OVER w >= TIMEFRAME
    THEN 
      time_stamp
    ELSE
      LAG(current_non_dupl_time_stamp) OVER w
    END AS current_non_dupl_time_stamp
  FROM event
  WINDOW w AS (PARTITION BY type ORDER BY time_stamp ASC)
)
SELECT time_stamp, time_stamp != current_non_dupl_time_stamp AS duplicate

But this does not work because the field which is calculated cannot be referenced in LAG:

ERROR:  column "current_non_dupl_time_stamp" does not exist.

So the question: can I rewrite this query to achieve the effect I need?

Erst answered 17/12, 2015 at 16:1 Comment(4)
I couldn't understand the timeframe part. specially this part: the next event which time_stamp if greater than previous non duplicate by more than TIMEFRAME is not duplicate. is timeframe a constant, a field, a calculation ?Incurvate
TIMEFRAME is some constant. The rationale is I want to skip event if it comes within given time frame after previous event which was not skipped.Erst
Your desired output contains timestamp 40, but your example data set does not? Could you clarify?Nahshun
You are right, it was a mistake.Erst
E
1

An alternative to a recursive approach is a custom aggregate. Once you master the technique of writing your own aggregates, creating transition and final functions is easy and logical.

State transition function:

create or replace function is_duplicate(st int[], time_stamp int, timeframe int)
returns int[] language plpgsql as $$
begin
    if st is null or st[1] + timeframe <= time_stamp
    then 
        st[1] := time_stamp;
    end if;
    st[2] := time_stamp;
    return st;
end $$;

Final function:

create or replace function is_duplicate_final(st int[])
returns boolean language sql as $$
    select st[1] <> st[2];
$$;

Aggregate:

create aggregate is_duplicate_agg(time_stamp int, timeframe int)
(
    sfunc = is_duplicate,
    stype = int[],
    finalfunc = is_duplicate_final
);

Query:

select *, is_duplicate_agg(time_stamp, 10) over w
from event
window w as (partition by type order by time_stamp asc)
order by type, time_stamp;

 id | type | time_stamp | is_duplicate_agg 
----+------+------------+------------------
  1 |    1 |          1 | f
  2 |    1 |          2 | t
  4 |    1 |          3 | t
  5 |    1 |         10 | t
  7 |    1 |         15 | f
  8 |    1 |         21 | t
 10 |    1 |         40 | f
  3 |    2 |          2 | f
  6 |    2 |         10 | t
  9 |    2 |         13 | f
(10 rows)   

Read in the documentation: 37.10. User-defined Aggregates and CREATE AGGREGATE.

Economic answered 9/6, 2018 at 11:59 Comment(0)
S
2

Naive recursive chain knitter:


        -- temp view to avoid nested CTE
CREATE TEMP VIEW drag AS
        SELECT e.type,e.time_stamp
        , ROW_NUMBER() OVER www as rn                   -- number the records
        , FIRST_VALUE(e.time_stamp) OVER www as fst     -- the "group leader"
        , EXISTS (SELECT * FROM event x
                WHERE x.type = e.type
                AND x.time_stamp < e.time_stamp) AS is_dup
        FROM event e
        WINDOW www AS (PARTITION BY type ORDER BY time_stamp)
        ;

WITH RECURSIVE ttt AS (
        SELECT d0.*
        FROM drag d0 WHERE d0.is_dup = False -- only the "group leaders"
    UNION ALL
        SELECT d1.type, d1.time_stamp, d1.rn
          , CASE WHEN d1.time_stamp - ttt.fst > 20 THEN d1.time_stamp
                 ELSE ttt.fst END AS fst   -- new "group leader"
          , CASE WHEN d1.time_stamp - ttt.fst > 20 THEN False
                 ELSE True END AS is_dup
        FROM drag d1
        JOIN ttt ON d1.type = ttt.type AND d1.rn = ttt.rn+1
        )
SELECT * FROM ttt
ORDER BY type, time_stamp
        ;

Results:


CREATE TABLE
INSERT 0 10
CREATE VIEW
 type | time_stamp | rn | fst | is_dup 
------+------------+----+-----+--------
    1 |          1 |  1 |   1 | f
    1 |          2 |  2 |   1 | t
    1 |          3 |  3 |   1 | t
    1 |         10 |  4 |   1 | t
    1 |         15 |  5 |   1 | t
    1 |         21 |  6 |   1 | t
    1 |         40 |  7 |  40 | f
    2 |          2 |  1 |   2 | f
    2 |         10 |  2 |   2 | t
    2 |         13 |  3 |   2 | t
(10 rows)
Strikebound answered 7/1, 2017 at 19:22 Comment(0)
A
1

This feels more like a recursive problem than windowing function. The following query obtained the desired results:

WITH RECURSIVE base(type, time_stamp) AS (

  -- 3. base of recursive query
  SELECT x.type, x.time_stamp, y.next_time_stamp
    FROM 
         -- 1. start with the initial records of each type   
         ( SELECT type, min(time_stamp) AS time_stamp
             FROM event
             GROUP BY type
         ) x
         LEFT JOIN LATERAL
         -- 2. for each of the initial records, find the next TIMEFRAME (10) in the future
         ( SELECT MIN(time_stamp) next_time_stamp
             FROM event
             WHERE type = x.type
               AND time_stamp > (x.time_stamp + 10)
         ) y ON true

  UNION ALL

  -- 4. recursive join, same logic as base
  SELECT e.type, e.time_stamp, z.next_time_stamp
    FROM event e
    JOIN base b ON (e.type = b.type AND e.time_stamp = b.next_time_stamp)
    LEFT JOIN LATERAL
    ( SELECT MIN(time_stamp) next_time_stamp
       FROM event
       WHERE type = e.type
         AND time_stamp > (e.time_stamp + 10)
    ) z ON true

)

-- The actual query:

-- 5a. All records from base are not duplicates
SELECT time_stamp, type, false
  FROM base

UNION

-- 5b. All records from event that are not in base are duplicates
SELECT time_stamp, type, true
  FROM event
  WHERE (type, time_stamp) NOT IN (SELECT type, time_stamp FROM base) 

ORDER BY type, time_stamp

There are a lot of caveats with this. It assumes no duplicate time_stamp for a given type. Really the joins should be based on a unique id rather than type and time_stamp. I didn't test this much, but it may at least suggest an approach.

This is my first time to try a LATERAL join. So there may be a way to simplify that moe. Really what I wanted to do was a recursive CTE with the recursive part using MIN(time_stamp) based on time_stamp > (x.time_stamp + 10), but aggregate functions are not allowed in CTEs in that manner. But it seems the lateral join can be used in the CTE.

Alkanet answered 7/1, 2017 at 17:28 Comment(0)
E
1

An alternative to a recursive approach is a custom aggregate. Once you master the technique of writing your own aggregates, creating transition and final functions is easy and logical.

State transition function:

create or replace function is_duplicate(st int[], time_stamp int, timeframe int)
returns int[] language plpgsql as $$
begin
    if st is null or st[1] + timeframe <= time_stamp
    then 
        st[1] := time_stamp;
    end if;
    st[2] := time_stamp;
    return st;
end $$;

Final function:

create or replace function is_duplicate_final(st int[])
returns boolean language sql as $$
    select st[1] <> st[2];
$$;

Aggregate:

create aggregate is_duplicate_agg(time_stamp int, timeframe int)
(
    sfunc = is_duplicate,
    stype = int[],
    finalfunc = is_duplicate_final
);

Query:

select *, is_duplicate_agg(time_stamp, 10) over w
from event
window w as (partition by type order by time_stamp asc)
order by type, time_stamp;

 id | type | time_stamp | is_duplicate_agg 
----+------+------------+------------------
  1 |    1 |          1 | f
  2 |    1 |          2 | t
  4 |    1 |          3 | t
  5 |    1 |         10 | t
  7 |    1 |         15 | f
  8 |    1 |         21 | t
 10 |    1 |         40 | f
  3 |    2 |          2 | f
  6 |    2 |         10 | t
  9 |    2 |         13 | f
(10 rows)   

Read in the documentation: 37.10. User-defined Aggregates and CREATE AGGREGATE.

Economic answered 9/6, 2018 at 11:59 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.