Get total time interval from multiple rows if sequence not broken

Asked 30/8, 2011 at 13:40 Answered 30/8, 2011 at 14:50

Solved sql postgresql datetime gaps-and-islands

I have Work and Person tables (these are just examples to understand problem).

Structure

`Work` table

id INTEGER
person_id INTEGER
dt_from DATETIME
dt_to DATETIME

`Person` table

person_id INTEGER
name VARCHAR(50)

Data

`Work` table

id | person_id | dt_from    | dt_to
-------------------------------------------------
1  | 1         | 2011-01-01 | 2011-02-02
2  | 1         | 2011-02-02 | 2011-04-04
3  | 1         | 2011-06-06 | 2011-09-09
4  | 2         | 2011-01-01 | 2011-02-02
5  | 2         | 2011-02-02 | 2011-03-03
....etc.

`Person` table

Just person names with person id

Expected output

Person 1 : 2011-01-01 - 2011-04-04
Person 1 : 2011-06-06 - 2011-09-09
Person 2 : 2011-01-01 - 2011-03-03

Interval must be in sequence. It cannot be broken somewhere in middle. Thats why Person 1 have two intervals.

I'm using postgres if it changes something. Have you any thougths? I wanted do it in one query, but if there is no such solution i will do some interval merge in php.

Churrigueresque answered 30/8, 2011 at 13:40 Comment(1)

I've deleted the MySQL tag as you say you are using postgres and added the gaps-and-islands tag. You might find your answer there. – Stereoisomerism 30/8, 2011 at 14:2

There may be a way to do this in one SQL select, but it escapes me. I managed to do it with one stored function though. Here's what I did for my testing:

create table work
(id integer, start_date date, end_date date);

insert into work values (1, '2011-01-01','2011-02-02');
insert into work values (1, '2011-02-02','2011-04-04');
insert into work values (1, '2011-06-06','2011-09-09');
insert into work values (2, '2011-01-01','2011-02-02');
insert into work values (2, '2011-02-02','2011-03-03');

create or replace function get_data() returns setof work as
$body$
declare
    res work%rowtype;
    sd  date := null;
begin
    for res in
        select
            w1.id,
            case when exists (select 1 from work w2 where w1.id=w2.id and w2.end_date=w1.start_date) then null else w1.start_date end,
            case when exists (select 1 from work w2 where w1.id=w2.id and w2.start_date=w1.end_date) then null else w1.end_date end
        from
            work w1
        order by
            id, start_date, end_date
    loop
        if res.start_date is not null and res.end_date is not null then
            return next res;
        elsif res.start_date is not null then
            sd := res.start_date;
        elsif res.end_date is not null then
            res.start_date := sd;
            return next res;
        end if;
    end loop;

    return;
end;$body$
  language 'plpgsql';

Then

select * from get_data() order by id, start_date;

returned this result:

 id | start_date |  end_date
----+------------+------------
  1 | 2011-01-01 | 2011-04-04
  1 | 2011-06-06 | 2011-09-09
  2 | 2011-01-01 | 2011-03-03
(3 rows)

which is, I think, what you're after.

Decadence answered 30/8, 2011 at 14:35 Comment(1)

Thanks! i think i will go with this solution! Just need to rewrite some parts to integrate in my db structure ;) Thanks again! – Churrigueresque 31/8, 2011 at 7:30

You could try postgres's WITH RECURSIVE construct. (after all, a linked list is a kind of tree) Getting the boundary conditions right will be a problem, but at least it would solve the problem without the need for loops.

UPDATE: Added code. The problem with RECURSIVE is that you can only specify only the "tail" boundary condition. To specify the "head" condition, you need to wrap it into a view.

CREATE VIEW collected_time AS (
    WITH RECURSIVE ztree(person_id, dt_from, dt_to)  AS ( 
    -- Terminal part
    SELECT  pr.person_id, pr.dt_from, pr.dt_to
    FROM prikklok pr
    WHERE NOT EXISTS (
            SELECT * FROM prikklok px
            WHERE px.person_id = pr.person_id AND px.dt_from = pr.dt_to
            )
    UNION
    -- Recursive part
    SELECT  p1.person_id AS person_id
    , p1.dt_from AS dt_from
    , p2.dt_to AS dt_to
    FROM prikklok AS p1
    , ztree AS p2
    WHERE p1.person_id = p2.person_id
    AND p1.dt_to = p2.dt_from
    )
SELECT *
FROM ztree zt
WHERE NOT EXISTS (select *
    FROM prikklok p3
    WHERE p3.person_id = zt.person_id
        AND p3.dt_to = zt.dt_from
            )
    );

SELECT * FROM collected_time;

-- now generate some data with gaps

INSERT INTO prikklok
    SELECT serie_n
    , serie_t
    , serie_t + '1 month'::interval
FROM generate_series (1,10) serie_n
    , generate_series ( '1970-01-01 00:00:00' , '2011-09-01 00:00:00' , '1 month' ::interval) serie_t
    ;

DELETE FROM prikklok
WHERE random() <  0.001
    ;

-- a few indexes won't hurt
ALTER TABLE prikklok ADD PRIMARY KEY (person_id,dt_from)
    ;
CREATE UNIQUE INDEX ON prikklok (person_id,dt_to);

The resulting query plan looks perfect:

                                                                              QUERY PLAN                                                                           
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop Anti Join  (cost=1389.73..1469.09 rows=1 width=20) (actual time=13.580..40.920 rows=16 loops=1)
   CTE ztree
     ->  Recursive Union  (cost=0.00..1389.73 rows=11 width=20) (actual time=0.136..27.405 rows=5004 loops=1)
           ->  Merge Anti Join  (cost=0.00..638.92 rows=1 width=20) (actual time=0.130..10.011 rows=16 loops=1)
                 Merge Cond: ((pr.person_id = px.person_id) AND (pr.dt_to = px.dt_from))
                 ->  Index Scan using prikklok_person_id_dt_to_idx on prikklok pr  (cost=0.00..291.31 rows=5004 width=20) (actual time=0.063..2.273 rows=5004 loops=1)
                 ->  Index Scan using prikklok_pkey on prikklok px  (cost=0.00..291.31 rows=5004 width=12) (actual time=0.012..2.204 rows=5004 loops=1)
           ->  Nested Loop  (cost=0.00..75.06 rows=1 width=20) (actual time=0.002..0.027 rows=10 loops=501)
                 ->  WorkTable Scan on ztree p2  (cost=0.00..0.20 rows=10 width=20) (actual time=0.000..0.001 rows=10 loops=501)
                 ->  Index Scan using prikklok_person_id_dt_to_idx on prikklok p1  (cost=0.00..7.47 rows=1 width=20) (actual time=0.002..0.002 rows=1 loops=5004)
                       Index Cond: ((p1.person_id = p2.person_id) AND (p1.dt_to = p2.dt_from))
   ->  CTE Scan on ztree zt  (cost=0.00..0.22 rows=11 width=20) (actual time=0.138..29.887 rows=5004 loops=1)
   ->  Index Scan using prikklok_person_id_dt_to_idx on prikklok p3  (cost=0.00..7.18 rows=1 width=12) (actual time=0.002..0.002 rows=1 loops=5004)
         Index Cond: ((p3.person_id = zt.person_id) AND (p3.dt_to = zt.dt_from))
 Total runtime: 41.354 ms
(15 rows)

Cabbagehead answered 30/8, 2011 at 14:50 Comment(2)

i have tryed to do this task with recursive, but somehow cant figure out this one. This recursive function is little bit confusing for me :)) simple examples work fines. but switching to this problem its nightmare ;) thanks for reply – Churrigueresque 31/8, 2011 at 7:29

The RECURSIVE method will break if you need to restrict the date-intervals to a narrower window. (but it can be adapted of course) Note: above timing is for N=5000 records. – Cabbagehead 2/9, 2011 at 13:45

Structure

`Work` table

`Person` table

Data

`Work` table

`Person` table

Expected output

Recommended topics

Hot tags

Structure

Work table

Person table

Data

Work table

Person table

Expected output

Recommended topics

Hot tags

`Work` table

`Person` table

`Work` table

`Person` table