PostgreSQL - column value changed - select query optimization
Asked Answered
T

5

10

Say we have a table:

CREATE TABLE p
(
   id serial NOT NULL, 
   val boolean NOT NULL, 
   PRIMARY KEY (id)
);

Populated with some rows:

insert into p (val)
values (true),(false),(false),(true),(true),(true),(false);
ID  VAL
1   1
2   0
3   0
4   1
5   1
6   1
7   0

I want to determine when the value has been changed. So the result of my query should be:

ID  VAL
2   0
4   1
7   0

I have a solution with joins and subqueries:

select min(id) id, val from
(
  select p1.id, p1.val, max(p2.id) last_prev
  from p p1
  join p p2
    on p2.id < p1.id and p2.val != p1.val
  group by p1.id, p1.val
) tmp
group by val, last_prev
order by id;

But it is very inefficient and will work extremely slow for tables with many rows.
I believe there could be more efficient solution using PostgreSQL window functions?

SQL Fiddle

Trichloromethane answered 7/6, 2014 at 15:54 Comment(1)
Would you consider the value in the first row to be "changed" from previously "unknown" or "nothing"?Heisenberg
S
10

This is how I would do it with an analytic:

SELECT id, val
  FROM ( SELECT id, val
           ,LAG(val) OVER (ORDER BY id) AS prev_val
       FROM p ) x
  WHERE val <> COALESCE(prev_val, val)
  ORDER BY id

Update (some explanation):

Analytic functions operate as a post-processing step. The query result is broken into groupings (partition by) and the analytic function is applied within the context of a grouping.

In this case, the query is a selection from p. The analytic function being applied is LAG. Since there is no partition by clause, there is only one grouping: the entire result set. This grouping is ordered by id. LAG returns the value of the previous row in the grouping using the specified order. The result is each row having an additional column (aliased prev_val) which is the val of the preceding row. That is the subquery.

Then we look for rows where the val does not match the val of the previous row (prev_val). The COALESCE handles the special case of the first row which does not have a previous value.

Analytic functions may seem a bit strange at first, but a search on analytic functions finds a lot of examples walking through how they work. For example: http://www.cs.utexas.edu/~cannata/dbms/Analytic%20Functions%20in%20Oracle%208i%20and%209i.htm Just remember that it is a post-processing step. You won't be able to perform filtering, etc on the value of an analytic function unless you subquery it.

Sparrowgrass answered 7/6, 2014 at 16:24 Comment(5)
For the benefit of future readers with less familiarity with window functions, could you please explain why this works/what it's doing?Sidky
@Sidky Sure, some explanation has been added.Sparrowgrass
It works fine without COALESCE or am I missing something? sqlfiddle.com/#!15/30044/8Trichloromethane
@Nailgun: COALESCE would only be useful if your column could be NULL. In this case, COALESCE only kicks in for the first row - where it doesn't change anything. Neither val <> val nor val <> NULL evaluate to TRUE - which is the only result that matters in a WHERE clause. So, you can just remove COALESCE here. I wrote more in my answer.Heisenberg
@Glenn: Postgres never uses the term "analytic functions" for window functions - a rather odd term for what this class of functions does, at least to my ears. You probably come from an Oracle background.Heisenberg
H
8

Window function

Instead of calling COALESCE, you can provide a default from the window function lag() directly. A minor detail in this case since all columns are defined NOT NULL. But this may be essential to distinguish "no previous row" from "NULL in previous row".

SELECT id, val
FROM  (
   SELECT id, val, lag(val, 1, val) OVER (ORDER BY id) <> val AS changed
   FROM   p
   ) sub
WHERE  changed
ORDER  BY id;

Compute the result of the comparison immediately, since the previous value is not of interest per se, only a possible change. Shorter and may be a tiny bit faster.

If you consider the first row to be "changed" (unlike your demo output suggests), you need to observe NULL values - even though your columns are defined NOT NULL. Basic lag() returns NULL in case there is no previous row:

SELECT id, val
FROM  (
   SELECT id, val, lag(val) OVER (ORDER BY id) IS DISTINCT FROM val AS changed
   FROM   p
   ) sub
WHERE  changed
ORDER  BY id;

Or employ the additional parameters of lag() once again:

SELECT id, val
FROM  (
   SELECT id, val, lag(val, 1, NOT val) OVER (ORDER BY id) <> val AS changed
   FROM   p
   ) sub
WHERE  changed
ORDER  BY id;

Recursive CTE

As proof of concept. :) Performance won't keep up with posted alternatives.

WITH RECURSIVE cte AS (
   SELECT id, val
   FROM   p
   WHERE  NOT EXISTS (
      SELECT 1
      FROM   p p0
      WHERE  p0.id < p.id
      )
  
   UNION ALL
   SELECT p.id, p.val
   FROM   cte
   JOIN   p ON p.id   > cte.id
           AND p.val <> cte.val
   WHERE NOT EXISTS (
     SELECT 1
     FROM   p p0
     WHERE  p0.id   > cte.id
     AND    p0.val <> cte.val
     AND    p0.id   < p.id
     )
  )
SELECT * FROM cte;

With an improvement from @wildplasser.

SQL Fiddle demonstrating all.

Heisenberg answered 8/6, 2014 at 0:42 Comment(6)
You missed my minalimistic approach. (or should I attempt a recursive CTE solution ?-)Paracelsus
Maybe I'll try an attempt for a rCTE solution. I do think that the most simple (in mathematical terms ...) solution should be the preferred one.Paracelsus
@wildplasser: I was focussing on window functions before. Your piece of SQL art can't be improved further AFAICT (+1). Might melt the brain of some unsuspecting users, though. As for the rCTE ... here you go. :)Heisenberg
FROM p WHERE id = 1 -->> FROM p p1 WHERE NOT EXISTS ( SELECT 1 FROM p px WHERE px.id < p1.id)Paracelsus
@wildplasser: Of course. Applied that to make it shine.Heisenberg
How would I go about in a situation where there is a third column such as lets say device_id. I would like to fetch the values that the change happened for all device_id values.Chaqueta
P
3

Can even be done without window functions.

SELECT * FROM p p0
WHERE EXISTS (
        SELECT * FROM p ex
        WHERE ex.id < p0.id
        AND ex.val <> p0.val
        AND NOT EXISTS (
                SELECT * FROM p nx
                WHERE nx.id < p0.id
                AND nx.id > ex.id
                )
        );

UPDATE: Self-joining a non-recursive CTE (could also be a subquery instead of a CTE)

WITH drag AS (
        SELECT id
        , rank() OVER (ORDER BY id) AS rnk
        , val
        FROM p
        )
SELECT d1.*
FROM drag d1
JOIN drag d0 ON d0.rnk = d1.rnk -1
WHERE d1.val <> d0.val
        ;

This nonrecursive CTE approach is surprisingly fast, although it needs an implicit sort.

Paracelsus answered 7/6, 2014 at 16:10 Comment(2)
If I used MySql this would be the accepted answer :)Trichloromethane
Well I did. On my real case which is little bit more complicated and uses PostGis points and geozones instead of boolean values on ~30000 rows the accepted solution performs better. I'm not a PostgreSql expert but in seems the accepted solution has lower cost: sqlfiddle.com/#!15/962ac/5 sqlfiddle.com/#!15/30044/6 Anyway thank you for your version, i't much better than mine.Trichloromethane
A
1

Using 2 row_number() computations: This is also possible to do with usual "islands and gaps" SQL technique (could be useful if you can't use lag() window function for some reason:

with cte1 as (
    select
        *,
        row_number() over(order by id) as rn1,
        row_number() over(partition by val order by id) as rn2
    from p
)
select *, rn1 - rn2 as g
from cte1
order by id

So this query will give you all islands

ID VAL RN1 RN2  G
1   1   1   1   0
2   0   2   1   1
3   0   3   2   1
4   1   4   2   2
5   1   5   3   2
6   1   6   4   2
7   0   7   3   4

You see, how G field could be used to group this islands together:

with cte1 as ( select *, row_number() over(order by id) as rn1, row_number() over(partition by val order by id) as rn2 from p ) select min(id) as id, val from cte1 group by val, rn1 - rn2 order by 1

So you'll get

ID VAL
1   1
2   0
4   1
7   0

The only thing now is you have to remove first record which can be done by getting min(...) over() window function:

with cte1 as (
   ...
), cte2 as (
    select
        min(id) as id,
        val,
        min(min(id)) over() as mid
    from cte1
    group by val, rn1 - rn2
)
select id, val
from cte2
where id <> mid

And results:

ID VAL
2   0
4   1
7   0
Alveolate answered 27/2, 2015 at 9:19 Comment(0)
F
0

A simple inner join can do it. SQL Fiddle

select p2.id, p2.val
from
    p p1
    inner join
    p p2 on p2.id = p1.id + 1
where p2.val != p1.val
Fernanda answered 7/6, 2014 at 18:41 Comment(3)
While the solution is correct, IDs can go not one by one in real life.Trichloromethane
@Trichloromethane Your sample data should reflect the conditions in your environment.Fernanda
For my part, I always assume there are id numbers missing from the real data, regardless of sample data provided. Heck, @Nailgun, I also assume that the values aren't even indicative of the actual insert order, much less the business-relevant order! Ids really only have value due to their links, and the fact that they should be unique in their source tables. Any other use is attempting to assign meaning where none exists.Sidky

© 2022 - 2024 — McMap. All rights reserved.