(self) join by time intervals
Asked Answered
N

1

7

I have a table in an oracle database. The schema is

create table PERIODS
( 
  ID NUMBER, 
  STARTTIME TIMESTAMP, 
  ENDTIME TIMESTAMP, 
  TYPE VARCHAR2(100)
)

I have two different TYPE's: TYPEA and TYPEB. The have independent start and end times and they can overlap. What I would like to find are the periods of TYPEB that started, are fully contained or ended within a given period of TYPEA.

Here is what I came up with so far (with some sample data)

WITH mydata 
     AS (SELECT 100                                                    ID, 
                To_timestamp('2015-08-01 11:00', 'YYYY-MM-DD HH24:MI') STARTTIME, 
                To_timestamp('2015-08-01 11:20', 'YYYY-MM-DD HH24:MI') ENDTIME, 
                'TYPEA'                                                TYPE 
         FROM   dual 
         UNION ALL 
         SELECT 110                                                    ID, 
                To_timestamp('2015-08-01 11:30', 'YYYY-MM-DD HH24:MI') STARTTIME, 
                To_timestamp('2015-08-01 11:50', 'YYYY-MM-DD HH24:MI') ENDTIME, 
                'TYPEA'                                                TYPE 
         FROM   dual 
         UNION ALL 
         SELECT 120                                                    ID, 
                To_timestamp('2015-08-01 12:00', 'YYYY-MM-DD HH24:MI') STARTTIME, 
                To_timestamp('2015-08-01 12:20', 'YYYY-MM-DD HH24:MI') ENDTIME, 
                'TYPEA'                                                TYPE 
         FROM   dual 
         UNION ALL 
         SELECT 105                                                    ID, 
                To_timestamp('2015-08-01 10:55', 'YYYY-MM-DD HH24:MI') STARTTIME, 
                To_timestamp('2015-08-01 11:05', 'YYYY-MM-DD HH24:MI') ENDTIME, 
                'TYPEB'                                                TYPE 
         FROM   dual 
         UNION ALL 
         SELECT 108                                                    ID, 
                To_timestamp('2015-08-01 11:05', 'YYYY-MM-DD HH24:MI') STARTTIME, 
                To_timestamp('2015-08-01 11:15', 'YYYY-MM-DD HH24:MI') ENDTIME, 
                'TYPEB'                                                TYPE 
         FROM   dual 
         UNION ALL 
         SELECT 111                                                    ID, 
                To_timestamp('2015-08-01 11:15', 'YYYY-MM-DD HH24:MI') STARTTIME, 
                To_timestamp('2015-08-01 12:25', 'YYYY-MM-DD HH24:MI') ENDTIME, 
                'TYPEB'                                                TYPE 
         FROM   dual), 
     typeas 
     AS (SELECT starttime, 
                endtime 
         FROM   mydata 
         WHERE  TYPE = 'TYPEA'), 
     typebs 
     AS (SELECT id, 
                starttime, 
                endtime 
         FROM   mydata 
         WHERE  TYPE = 'TYPEB') 
SELECT id 
FROM   typebs b 
       join typeas a 
         ON ( b.starttime BETWEEN a.starttime AND a.endtime ) 
             OR ( b.starttime BETWEEN a.starttime AND a.endtime 
                  AND b.endtime BETWEEN a.starttime AND a.endtime ) 
             OR ( b.endtime BETWEEN a.starttime AND a.endtime ) 
ORDER  BY id; 

This seems to work in principle, the result from the query above is

        ID
----------
       105
       108
       111

so it selects the three periods TYPEB that started or ended inside the first TYPEA period.

The problem is that the table has about 200k entries and already at this size the above query is quite slow --- which is very surprising to me as the number of both TYPEA and TYPEB entries is quite low ( 1-2k )

Is there a more efficient way to perform this type of self join? Did I miss something else in my query?

Niacin answered 1/8, 2015 at 20:27 Comment(8)
The table has 200,000 rows, there are only two different type values, but each type only has 1,000-2,000 rows. Does that imply that ~196,000 rows in the table have a NULL for their type? If you're asking a tuning question, can you post the query plan you're getting and what indexes are available?Nomadic
The table has 200k rows, most of which have a type that is not of relevance for me for this query.Niacin
@JustinCave this might be a tuning question, but maybe I am simply oblivious of some other feature that would make this query much quicker without tuning.Niacin
don't see anything wrong in what you are already doing. If about 2k rows have either typeA or typeB, you are already filtering for them in your cte's. So at anytime your query would only scan ~2k rows. But in the join conditions what you are doing is (A union B union A intersection B), which is equivalent to A union B (conditions being referred to as A,B). Try taking that out and see if you get any performance gain.Tolerance
WIthout knowing the query plan and the indexes that you have, it's hard to guess. As a first step, I'd want to make sure that there was an index on (type,starttime,endtime) and that the statistics on type were accurate (which might require a histogram depending on how many different type values they really are and whether typeA and typeB are more or less common than other type values).Nomadic
you can remove one condition from WHERE clause as it is redundant, see sqlfiddle.com/#!4/4d32c/12Synchronize
Don't you also need the id of the period it was contained in? If not the query could be rewritten.Fermentative
can you post the answer for select type, count(*) from periods group by type;?Hotheaded
F
1

Maybe worth a try (also you need to write the most restricting conditions in the end in oracle, don't ask me why or believe me, better do your own performance tests):

SELECT
   p.id
FROM
   periods p
WHERE
   EXISTS(SELECT * FROM periods q WHERE
      (p.startTime BETWEEN q.startTime AND q.endTime
      OR p.endTime BETWEEN q.startTime AND q.endTime
      OR p.startTime < q.startTime AND p.endTime > q.endTime -- overlapping correction, remove if not needed
      ) AND q.type = 'TYPEA'
   ) AND p.type = 'TYPEB'
ORDER BY
   p.id
;
Fermentative answered 1/8, 2015 at 22:34 Comment(2)
AND q.type = 'TYPEA' should be inside subquerySynchronize
It is inside, maybe the braces look odd, but I emphasize more on indentation.Fermentative

© 2022 - 2024 — McMap. All rights reserved.