Merging intervals in one pass in SQL

Asked 9/12, 2011 at 21:30 Answered 25/1, 2012 at 20:54

Let's say I have a table with two columns: start and end, both integers, and the table is ordered by the first, then second column. Each row represents an interval.

What I need is the table of merged intervals: all overlapping or adjacent intervals gobbled up into one.

It can be constructed with a JOIN query, but that is quadratic in the number of rows, which is 4 million rows in my case (I decided to compose this question because the query is still running).

It can also be done in a single pass, by running through each row and keeping track of the maximum end time - but how to do that, or something equivalent, in standard SQL? Is there any O(n) way to do it in SQL? I'm using SQLite right now; a SQLite-specific solution would also help me out this time.

From the answers to related questions (1, 2, 3, 4, 5, 6, 7, 8, 9) I can't tell whether it's possible.

Can you?

Homework answered 9/12, 2011 at 21:30 Comment(6)

I can think of ways to accomplish this using Common Table Expressions or recursive queries, but SQLite doesn't support those features. PostgreSQL does though :) – Ozone 9/12, 2011 at 21:31

Does speed trump all else? Would temp tables or something be ok for the sake of speed? – Underclay 9/12, 2011 at 21:35

what is the minimum possible "start" and the maximum possible "end"? or there is no limit at all in your case? is there a known limit for these values? (even if not actually used in the intervals in the table) – Marable 9/12, 2011 at 23:55

Using temp tables is fine. The first start and last end are about 4 million units apart (coincidentally), while the largest difference within the same row is usually 1 or 2, but peaks at 1000. – Homework 10/12, 2011 at 8:50

sqlite lets you create user defined functions in your host programming language. considering you can create aggregate functions, you could pull it off in a single pass. But, I'm not sure how much better this is than just fetching all the data and using a loop in your host language considering sqlite is embedded. – Symphonious 10/12, 2011 at 21:20

@chris: I have looked at this, but it's not a regular aggregate function in that the problem is in determining the groups, not in calculating aggregate values for given groups. – Homework 12/12, 2011 at 10:48

Well, here is a solution that works in MySQL (I don't know if it will work in SQlite). I think, but cannot prove, that is O(n) (discarding the time it takes to sort the events table initially, i.e. if it is already sorted as I think the question states.)

> SELECT * from events;
+-------+-----+
| start | end |
+-------+-----+
|     1 |   9 |
|     5 |   8 |
|     8 |  11 |
|    11 |  13 |
|    17 |  25 |
|    18 |  26 |
|    33 |  42 |
|    59 |  81 |
|    61 |  87 |
|    97 | 132 |
|   105 | 191 |
|   107 | 240 |
|   198 | 213 |
|   202 | 215 |
+-------+-----+
14 rows in set (0.00 sec)


SET @interval_id = 0;
SET @interval_end = 0;

SELECT
  MIN(start) AS start,
  MAX(end) AS end
  FROM
    (SELECT
       @interval_id := IF(start > @interval_end,
                          @interval_id + 1,
                          @interval_id) AS interval_id,
       @interval_end := IF(start < @interval_end,
                           GREATEST(@interval_end, end),
                           end) AS interval_end,
       events.*
     FROM events
     ORDER BY start,end) tmp
  GROUP BY interval_id;

+-------+------+
| start | end  |
+-------+------+
|     1 |   13 |
|    17 |   26 |
|    33 |   42 |
|    59 |   87 |
|    97 |  240 |
+-------+------+
5 rows in set (0.00 sec)

Compony answered 25/1, 2012 at 20:54 Comment(1)

Yes, it is sorted. I had no idea you could do this in MySQL. – Homework 27/1, 2012 at 10:0

In your links you have omitted one: Can I use a SQL Server CTE to merge intersecting dates? where I present a RECURSIVE CTE solution to the overlapping intervals problem. Recursive CTE's can be handled differently (compared to ordinary self-joins), and often perform amazingly fast.

mysql does not have recursive CTEs. Postgres has them, Oracle has them, Microsoft has them.

Here Querying for a 'run' of consecutive columns in Postgres is another one, with a fudge-factor.

Here Get total time interval from multiple rows if sequence not broken is yet another one.

Cordilleras answered 10/12, 2011 at 20:32 Comment(4)

I had no idea - thanks (unfortunately. SQLite doesn't support them: #7457457) – Homework 10/12, 2011 at 20:54

Is this more efficient than A.J.'s approach, and if it is, why? – Homework 10/12, 2011 at 20:56

A.J.'s approach does not use a first KEY component, only min(start) by end. Normally this would be a "select id, min(ddate)" (or max (ddate)) GROUP BY id. @reinierpost: it is not only more efficient, it is also correct ;-) – Cordilleras 11/12, 2011 at 0:43

the approach I mentioned is not complete at all to begin with. It is just the first step in order to minimize the rows that form the starting point for a merging solution. The approach is targeted towards databases where the concepts mentioned in your solution are not available, such as the current case when SQLite is used. – Marable 20/12, 2011 at 13:42

Based on the answer to my question in the comments, I don't think my idea would have worked. since you mentioned you it can (and I assume you know how) be done with joins, I had an idea of minimizing the number of rows to be joined by keeping only ranges that belong to distinct points like the following:

select start, max(end) as end
from (
      select min(start) as start,end
      from table
      group by end
     ) in_tab
group by in_tab.start

the above inner select makes sure that no end point repeats and selects the longest start point for each end. the outer select does just the opposite. we end up with ranges that start and end at different points (with any FULLY contained/overlapped range removed). This might have worked if the max range was not big. if these were dates and there is maximum a year difference between lowest date in the whole table and highest date in it, then it would have been 365*364 options to pick any two points and that would have been the higher limit to the possible rows after the above select. these then could have been used in a temp table using the join method that you already have. but with the numbers you mentioned, then theoretically we have a huge number that makes this try irrelevant. even though the above minimizes the rows to be used in the calculation, they would still be too much to use in a join.

I do not know of a way to make this in ANSI SQL without joins when there is no other non standard functionality provided by the RDBMS. for example in oracle this can easily be achieved with analytical functions. the best would be in this case to use the above to minimize the number of rows used and bring them to your application and there you can write code that calculates the ranges and insert them back in the database.

Marable answered 10/12, 2011 at 20:17 Comment(4)

I've thought of your approach and I think it will be more efficient than the join (the number of ends per start, or starts per end, is very much smaller than the total number of rows). I may need to add indexes. So it's sort of halfway there. – Homework 10/12, 2011 at 20:23

The problem is that just getting the MIN and MAX once isn't good enough. Intervals (1,4), (2,7), (5,9) are to be merged, even though the MAX(end) of 2 is 7 and the MIN(start) of 7 is 2. – Homework 20/12, 2011 at 9:2

you are right. but i mentioned in my text that the select is only to minimize rows to be worked on. that means, after the above query, you still need to do the join for final merging. (in your original text you mentioned you can do it with a join). all i wanted is to minimize the number of rows you use in the joins so what i did was the first step (i wrote that the results can be stored in temp table and then your method applied on them). – Marable 20/12, 2011 at 13:35

Yes, I thought of that idea, too, but after browsing the data, determined that it wouldn't make that much of a difference in my case. – Homework 20/12, 2011 at 20:49

For now, the best answer I've found is: use indexing. This brings the complexity down from quadratic to O(n log n).

With a covering index, the queries turned out to be fast enough for my needs; with just an index on either the start or end column, it was slower but still OK. In each case, EXPLAIN QUERY PLAN told me that a single table scan is combined with use of the index, as expected.

Finding an element in the index isn't quite O(1), but turned out to be close enough. And building the index isn't slow, either.

What remains is the proof that a true O(n) algorithm can't be written in SQL.

So another answer is to write it in a different language and then apply it to a SQLite table. There are various ways to make that work:

export the table to a CSV file; read the CSV file, apply the algorithm, produce CSV; import the resulting CSV file as a table;
use a SQLite driver for that language (e.g. DBD::SQLite for Perl, RSQLite for R)
write a SQLite extension function that somehow interfaces with the language of choice

Homework answered 20/12, 2011 at 9:16 Comment(0)

Recommended topics

Hot tags