flatten list of ranges to single result range set
Asked Answered
I

4

6

I am trying to "flatten" a list of ranges in a defined order (alphabetically by name in the examples provided) to a single merged result. Newer Ranges overwrite values of older ranges. Conceptually it looks like this, with "e" being the newest range:

0   1   2   3   4   5   6   7

|-------------a-------------|
        |---b---|            
    |---c---|                
                |---d---|    
            |---e---|        

|-a-|---c---|---e---|-d-|-a-|  <-- expected result

To prevent further confusion: The expected result here is indeed correct. The values 0 - 7 are just the ranges' values, not a progression in time. I use integers for simplicity here, but the values might not be discrete but continuous.

Note that b is completely overshadowed and not relevant anymore.

the data may be modeled like this in SQL:

create table ranges (
    name varchar(1),
    range_start integer,
    range_end integer
);

insert into ranges (name, range_start, range_end) values ('a', 0, 7);
insert into ranges (name, range_start, range_end) values ('b', 2, 4);
insert into ranges (name, range_start, range_end) values ('c', 1, 3);
insert into ranges (name, range_start, range_end) values ('d', 4, 6);
insert into ranges (name, range_start, range_end) values ('e', 3, 5);
-- assume alphabetical order by name

It would be perfect if there was a way to directly query the result in SQL, e.g. like this:

select *magic* from ranges;
-- result:
+------+-------------+-----------+
| a    |           0 |         1 |
| c    |           1 |         3 |
| e    |           3 |         5 |
| d    |           5 |         6 |
| a    |           6 |         7 |
+------+-------------+-----------+

But I suspect that is not realistically feasible, therefore I need to at least filter out all ranges that are overshadowed by newer ones, as is the case for b in the example above. Otherwise the query would need to transfer more and more irrelevant data as the database grows and new ranges overshadow older ones. For the example above, such a query could return all entries except for b, e.g.:

select *magic* from ranges;
-- result:
+------+-------------+-----------+
| a    |           0 |         7 |
| c    |           1 |         3 |
| d    |           4 |         6 |
| e    |           3 |         5 |
+------+-------------+-----------+

I was unable to construct such a filter in SQL. The only thing I managed to do is query all data and then calculate the result in code, for example in Java using the Google Guava library:

final RangeMap<Integer, String> rangeMap = TreeRangeMap.create();
rangeMap.put(Range.closedOpen(0, 7), "a");
rangeMap.put(Range.closedOpen(2, 4), "b");
rangeMap.put(Range.closedOpen(1, 3), "c");
rangeMap.put(Range.closedOpen(4, 6), "d");
rangeMap.put(Range.closedOpen(3, 5), "e");
System.out.println(rangeMap);
// result: [[0..1)=a, [1..3)=c, [3..5)=e, [5..6)=d, [6..7)=a]

Or by hand in python:

import re
from collections import namedtuple
from typing import Optional, List

Range = namedtuple("Range", ["name", "start", "end"])


def overlap(lhs: Range, rhs: Range) -> Optional[Range]:
    if lhs.end <= rhs.start or rhs.end <= lhs.start:
        return None
    return Range(None, min(lhs.start, rhs.start), max(lhs.end, rhs.end))


def range_from_str(str_repr: str) -> Range:
    name = re.search(r"[a-z]+", str_repr).group(0)
    start = str_repr.index("|") // 4
    end = str_repr.rindex("|") // 4
    return Range(name, start, end)


if __name__ == '__main__':
    ranges: List[Range] = [
        #               0   1   2   3   4   5   6   7
        range_from_str("|-------------a-------------|"),
        range_from_str("        |---b---|            "),
        range_from_str("    |---c---|                "),
        range_from_str("                |---d---|    "),
        range_from_str("            |---e---|        "),
        # result:       |-a-|---c---|---e---|-d-|-a-|
    ]

    result: List[Range] = []
    for range in ranges:
        for i, res in enumerate(result[:]):
            o = overlap(range, res)
            if o:
                result.append(Range(res.name, o.start, range.start))
                result.append(Range(res.name, range.end, o.end))
                result[i] = Range(res.name, 0, 0)
        result.append(range)
    result = sorted(filter(lambda r: r.start < r.end, result), key=lambda r: r.start)
    print(result)
    # result: [Range(name='a', start=0, end=1), Range(name='c', start=1, end=3), Range(name='e', start=3, end=5), Range(name='d', start=5, end=6), Range(name='a', start=6, end=7)]
Intelligence answered 30/9, 2020 at 12:46 Comment(9)
Please tag your question with the database that you are running: mysql, oracle, postgresql...?Liquid
I tagged the question with "oracle" and "postgresql", as answers working in either of those (or even better: in both) are most valuable to me. However, answers using other databases are appreciated tooIntelligence
What is the data structure you expect as answer? Please add that to the question.Whencesoever
The databases I am really using are both PostgreSQL and Oracle. I decided to keep the "oracle" tagIntelligence
The SQL snippet is now syntactically valid for oracle, and I've also added two expected outputs in the form of SQL result ascii tablesIntelligence
@Intelligence . . . I still don't get why "b" is overshadowed. It should be the most recent at time 2.Blenny
I posted pure SQL solution here for the same question here not so long ago (maybe up to 5-6 months ago). But anyway I'll solve it again on your data just for funWatereddown
@GordonLinoff "b" is overshadowed, because the entire range it occupies is also occupied by "newer" ranges, here "c" and "e". I tried to clarify that ordering in the first paragraph a bitIntelligence
@Intelligence . . . "c" is not newer at time "2". "b" is.Blenny
W
3

The following simple query returns all smallest intervals with top name:

with
 all_points(x) as (
   select range_start from ranges
   union 
   select range_end from ranges
 )
,all_ranges(range_start, range_end) as (
   select *
   from (select
           x as range_start, 
           lead(x) over(order by x) as range_end
         from all_points)
   where range_end is not null
)
select *
from all_ranges ar
     cross apply (
     select max(name) as range_name
     from ranges r
     where r.range_end   >= ar.range_end
       and r.range_start <= ar.range_start
     )
order by 1,2;

Results:

RANGE_START  RANGE_END RANGE_NAME
----------- ---------- ----------
          0          1 a
          1          2 c
          2          3 c
          3          4 e
          4          5 e
          5          6 d
          6          7 a

So we need to merge connected intervals with the same names:

Final query without new oracle-specific features

with
 all_points(x) as (
   select range_start from ranges
   union 
   select range_end from ranges
 )
,all_ranges(range_start, range_end) as (
   select *
   from (select
           x as range_start, 
           lead(x) over(order by x) as range_end
         from all_points)
   where range_end is not null
)
select 
   grp,range_name,min(range_start) as range_start,max(range_end) as range_end
from (
   select
      sum(start_grp_flag) over(order by range_start) grp
     ,range_start,range_end,range_name
   from (
      select 
        range_start,range_end,range_name,
        case when range_name = lag(range_name)over(order by range_start) then 0 else 1 end start_grp_flag
      from all_ranges ar
           cross apply (
           select max(name) as range_name
           from ranges r
           where r.range_end   >= ar.range_end
             and r.range_start <= ar.range_start
           )
   )
)
group by grp,range_name
order by 1;

Results:

       GRP RANGE_NAME RANGE_START  RANGE_END
---------- ---------- ----------- ----------
         1 a                    0          1
         2 c                    1          3
         3 e                    3          5
         4 d                    5          6
         5 a                    6          7

Or using actual oracle specific features:

with
 all_ranges(range_start, range_end) as (
   select * from (
      select 
        x as range_start, 
        lead(x) over(order by x) as range_end
      from (
         select distinct x 
         from ranges 
         unpivot (x for r in (range_start,range_end))
      ))
   where range_end is not null
 )
select  *
from all_ranges ar
     cross apply (
     select max(name) as range_name
     from ranges r
     where r.range_end   >= ar.range_end
       and r.range_start <= ar.range_start
     )
match_recognize(
   order by range_start
   measures 
      first(range_start) as r_start,
      last(range_end) as r_end,
      last(range_name) as r_name
   pattern(STRT A*)
   define
     A as prev(range_name)=range_name and prev(range_end) = range_start
);
Watereddown answered 30/9, 2020 at 14:49 Comment(4)
Would this also work with union all instead of union?Intelligence
@Intelligence do you mean union in all_points? Of course, but you will need to add distinct. Why do you need union?Watereddown
I was talking with a DBA regarding implementing this and the question came up whether it could be done using union all for performance reasons. We don't have any hard data yet, but it's good to keep that option in mind for nowIntelligence
@Intelligence Oracle or Postgres? In oracle it would be much better to use the last one query with unpivot instead of union. And there is no reason to use union all, since you need unique range points.Watereddown
O
1

Here is a hierarchical query that would give you the desired output:

WITH ranges(NAME, range_start, range_end) AS 
  (SELECT 'a', 0, 7 FROM dual UNION ALL 
   SELECT 'b', 2, 4 FROM dual UNION ALL 
   SELECT 'c', 1, 3 FROM dual UNION ALL 
   SELECT 'd', 4, 6 FROM dual UNION ALL 
   SELECT 'e', 3, 5 FROM dual UNION ALL 
   SELECT 'f', -3, -2 FROM dual UNION ALL 
   SELECT 'g', 8, 20 FROM dual UNION ALL 
   SELECT 'h', 12, 14 FROM dual)
, rm (NAME, range_start, range_end) AS 
  (SELECT r.*
     FROM (SELECT r.NAME
                       , r.range_start
                       , NVL(r2.range_start, r.range_end) range_end
                    FROM ranges r
                    OUTER apply (SELECT *
                                   FROM ranges
                                  WHERE range_start BETWEEN r.range_start AND r.range_end
                                    AND NAME > r.NAME
                                 ORDER BY range_start, NAME DESC
                                 FETCH FIRST 1 ROWS ONLY) r2
                    ORDER BY r.range_start, r.NAME desc
                    FETCH FIRST 1 ROWS ONLY) r
  UNION ALL
   SELECT r2.NAME
        , r2.range_start
        , r2.range_end
     FROM rm
    CROSS apply (SELECT r.NAME
                      , GREATEST(rm.range_end, r.range_start) range_start
                      , NVL(r2.range_start, r.range_end) range_end
                   FROM ranges r
                   OUTER apply (SELECT *
                                  FROM ranges
                                 WHERE range_start BETWEEN GREATEST(rm.range_end, r.range_start) AND r.range_end
                                   AND NAME > r.NAME
                                ORDER BY range_start, NAME DESC
                                FETCH FIRST 1 ROWS ONLY) r2
                  WHERE r.range_end > rm.range_end
                    AND NOT EXISTS (SELECT 1 FROM ranges r3
                                     WHERE r3.range_end > rm.range_end
                                       AND (GREATEST(rm.range_end, r3.range_start) < GREATEST(rm.range_end, r.range_start)
                                        OR (GREATEST(rm.range_end, r3.range_start) = GREATEST(rm.range_end, r.range_start)
                                        AND r3.NAME > r.NAME))) 
                  FETCH FIRST 1 ROWS ONLY) r2)    
CYCLE NAME, range_start, range_end SET cycle TO 1 DEFAULT 0              
 SELECT * FROM rm

First you get the first entry ordered by range_start desc, name which will give you the most resent entry with the lowest name. Then you search for a range with higher name that intersect with this range. If there is one the range_start of this interval will be the range_end of you final interval.

With this start you search more or less the next entry with the same condition.

Ojeda answered 30/9, 2020 at 14:22 Comment(1)
This is some voodoo magic to me, but it appears to work just fine. Thank you, I will try to develop an understanding for itIntelligence
W
1

There is also another less effective but easier and shorter approach: to generate all points and just aggregate them.

For example this simple query will generate all intermediate points:

select x,max(name)
from ranges,
     xmltable('xs:integer($A) to xs:integer($B)'
       passing range_start as a
              ,range_end as b
       columns x int path '.'
     )
group by x

Results:

         X M
---------- -
         0 a
         1 c
         2 c
         3 e
         4 e
         5 e
         6 d
         7 a

Then we can merge them:

select *
from (
   select x,max(name) name
   from ranges,
        xmltable('xs:integer($A) to xs:integer($B)-1'
          passing range_start as a
                 ,range_end as b
          columns x int path '.'
        )
   group by x
   order by 1
)
match_recognize(
   order by x
   measures 
      first(x) as r_start,
      last(x)+1 as r_end,
      last(name) as r_name
   pattern(STRT A*)
   define
     A as prev(name)=name and prev(x)+1 = x
);

Results:

   R_START      R_END R
---------- ---------- -
         0          1 a
         1          3 c
         3          5 e
         5          6 d
         6          7 a
Watereddown answered 30/9, 2020 at 20:50 Comment(2)
Thank you, this is a nice solution, if the range of values are discrete. I used discrete values in my example for simplicity, but in the actual code I am dealing with continuous values.Intelligence
@Intelligence ok, that is very important thing and you should have been added it into original question. Anyway, in this case you can just use another my solution aboveWatereddown
B
0

I don't understand your results -- as I've explained in the comments. The "b" should be present, because it is most recent at time 2.

That said, the idea is to unpivot the times and figure out the most recent name at each time -- both beginnings and ends. Then, combine these using gaps-and-islands ideas. This is what the query looks like:

with r as (
      select name, range_start as t
      from ranges
      union all
      select null, range_end as t
      from ranges
     ),
     r2 as (
      select r.*,
             (select r2.name
              from ranges r2
              where r2.range_start <= r.t and
                    r2.range_end >= r.t
              order by r2.range_start desc
              fetch first 1 row only
             ) as imputed_name
      from (select distinct t
            from r
           ) r
     )
select imputed_name, t,
       lead(t) over (order by t)
from (select r2.*,
             lag(imputed_name) over ( order by t) as prev_imputed_name
      from r2
     ) r2
where prev_imputed_name is null or prev_imputed_name <> imputed_name;

Here is a db<>fiddle.

Basically the same code should run in Postgres as well.

Blenny answered 30/9, 2020 at 13:24 Comment(3)
thanks for your solution. Unfortunately, you seem to be misunderstanding the question. The x-axis is not a progression in time, it's just the range values. If you want to think of these ranges in terms of "newer" and "olders", then that would be the y-axis with "e" being the most recent entryIntelligence
@Intelligence . . . In that case, your question is unclear. SQL tables represent unordered sets and you have no ordering that specifies what you are really trying to do.Blenny
I specified that in the example I provided the entries are ordered alphabetically by nameIntelligence

© 2022 - 2024 — McMap. All rights reserved.