How to divide a set of overlapping ranges into non-overlapping ranges?
Asked Answered
A

6

18

Let's say you have a set of ranges:

  • 0 - 100: 'a'
  • 0 - 75: 'b'
  • 95 - 150: 'c'
  • 120 - 130: 'd'

Obviously, these ranges overlap at certain points. How would you dissect these ranges to produce a list of non-overlapping ranges, while retaining information associated with their original range (in this case, the letter after the range)?

For example, the results of the above after running the algorithm would be:

  • 0 - 75: 'a', 'b'
  • 76 - 94: 'a'
  • 95 - 100: 'a', 'c'
  • 101 - 119: 'c'
  • 120 - 130: 'c', 'd'
  • 131 - 150: 'c'
Ahron answered 10/3, 2009 at 3:21 Comment(1)
This is a duplicate of #150077 (retaining the information is trivial)Erastes
D
16

I had the same question when writing a program to mix (partly overlapping) audio samples.

What I did was add an "start event" and "stop event" (for each item) to a list, sort the list by time point, and then process it in order. You could do the same, except using an integer point instead of a time, and instead of mixing sounds you'd be adding symbols to the set corresponding to a range. Whether you'd generate empty ranges or just omit them would be optional.

Edit Perhaps some code...

# input = list of (start, stop, symbol) tuples
points = [] # list of (offset, plus/minus, symbol) tuples
for start,stop,symbol in input:
    points.append((start,'+',symbol))
    points.append((stop,'-',symbol))
points.sort()

ranges = [] # output list of (start, stop, symbol_set) tuples
current_set = set()
last_start = None
for offset,pm,symbol in points:
    if pm == '+':
         if last_start is not None:
             #TODO avoid outputting empty or trivial ranges
             ranges.append((last_start,offset-1,current_set))
         current_set.add(symbol)
         last_start = offset
    elif pm == '-':
         # Getting a minus without a last_start is unpossible here, so not handled
         ranges.append((last_start,offset-1,current_set))
         current_set.remove(symbol)
         last_start = offset

# Finish off
if last_start is not None:
    ranges.append((last_start,offset-1,current_set))

Totally untested, obviously.

Diffidence answered 10/3, 2009 at 3:32 Comment(3)
That was absolutely perfect, thank you very much! Had to fix one thing, however: on the ranges.append line, current_set needs to be current_set.copy(), otherwise you end up just adding a reference to current_set for each one (therefore ending up with an empty set for each range at the end). Thanks!Ahron
Something to keep in mind is that this will work well when the range tags are unique, but won't properly handle duplicate tags in overlapping ranges (i.e. 0-10 = A, 5-10 = A, 10-20 = B). Otherwise, I'm very happy to see that somebody else had the same idea as me, and I'm not totally insane. Thanks!Onceover
In adapting this to another purpose, I replaced current_set.remove with current_set.discard to avoid an edge case where the symbol wasn't already in the set. Would love to hear if that's either wrong (for some reason) or a good change.Somber
A
8

A similar answer to Edmunds, tested, including support for intervals like (1,1):

class MultiSet(object):
    def __init__(self, intervals):
        self.intervals = intervals
        self.events = None

    def split_ranges(self):
        self.events = []
        for start, stop, symbol in self.intervals:
            self.events.append((start, True, stop, symbol))
            self.events.append((stop, False, start, symbol))

        def event_key(event):
            key_endpoint, key_is_start, key_other, _ = event
            key_order = 0 if key_is_start else 1
            return key_endpoint, key_order, key_other

        self.events.sort(key=event_key)

        current_set = set()
        ranges = []
        current_start = -1

        for endpoint, is_start, other, symbol in self.events:
            if is_start:
                if current_start != -1 and endpoint != current_start and \
                       endpoint - 1 >= current_start and current_set:
                    ranges.append((current_start, endpoint - 1, current_set.copy()))
                current_start = endpoint
                current_set.add(symbol)
            else:
                if current_start != -1 and endpoint >= current_start and current_set:
                    ranges.append((current_start, endpoint, current_set.copy()))
                current_set.remove(symbol)
                current_start = endpoint + 1

        return ranges


if __name__ == '__main__':
    intervals = [
        (0, 100, 'a'), (0, 75, 'b'), (75, 80, 'd'), (95, 150, 'c'), 
        (120, 130, 'd'), (160, 175, 'e'), (165, 180, 'a')
    ]
    multiset = MultiSet(intervals)
    pprint.pprint(multiset.split_ranges())


[(0, 74, {'b', 'a'}),
 (75, 75, {'d', 'b', 'a'}),
 (76, 80, {'d', 'a'}),
 (81, 94, {'a'}),
 (95, 100, {'c', 'a'}),
 (101, 119, {'c'}),
 (120, 130, {'d', 'c'}),
 (131, 150, {'c'}),
 (160, 164, {'e'}),
 (165, 175, {'e', 'a'}),
 (176, 180, {'a'})]
Alesiaalessandra answered 18/10, 2018 at 2:50 Comment(0)
C
1

I'd say create a list of the endpoints and sort it, also index the list of ranges by starting and ending points. Then iterate through the list of sorted endpoints, and for each one, check the ranges to see which ones are starting/stopping at that point.

This is probably better represented in code... if your ranges are represented by tuples:

ranges = [(0,100,'a'),(0,75,'b'),(95,150,'c'),(120,130,'d')]
endpoints = sorted(list(set([r[0] for r in ranges] + [r[1] for r in ranges])))
start = {}
end = {}
for e in endpoints:
    start[e] = set()
    end[e] = set()
for r in ranges:
    start[r[0]].add(r[2])
    end[r[1]].add(r[2])
current_ranges = set()
for e1, e2 in zip(endpoints[:-1], endpoints[1:]):
    current_ranges.difference_update(end[e1])
    current_ranges.update(start[e1])
    print '%d - %d: %s' % (e1, e2, ','.join(current_ranges))

Although looking at this in retrospect, I'd be surprised if there wasn't a more efficient (or at least cleaner-looking) way to do it.

Coss answered 10/3, 2009 at 3:25 Comment(0)
R
1

What you describe is an example of set theory. For a general algorithm for computing unions, intersections, and differences of sets see:

www.gvu.gatech.edu/~jarek/graphics/papers/04PolygonBooleansMargalit.pdf

While the paper is targeted at graphics it is applicable to general set theory as well. Not exactly light reading material.

Re answered 10/3, 2009 at 4:7 Comment(0)
C
0

Pseudocode:

unusedRanges = [ (each of your ranges) ]
rangesInUse = []
usedRanges = []
beginningBoundary = nil

boundaries = [ list of all your ranges' start and end values, sorted ]
resultRanges = []

for (boundary in boundaries) {
    rangesStarting = []
    rangesEnding = []

    // determine which ranges begin at this boundary
    for (range in unusedRanges) {
        if (range.begin == boundary) {
            rangesStarting.add(range)
        }
    }

    // if there are any new ones, start a new range
    if (rangesStarting isn't empty) {
        if (beginningBoundary isn't nil) {
            // add the range we just passed
            resultRanges.add(beginningBoundary, boundary - 1, [collected values from rangesInUse])
        }

        // note that we are starting a new range
        beginningBoundary = boundary

        for (range in rangesStarting) {
            rangesInUse.add(range)
            unusedRanges.remove(range)
        }
    }

    // determine which ranges end at this boundary
    for (range in rangesInUse) {
        if (range.end == boundary) {
            rangesEnding.add(range)
        }
    }

    // if any boundaries are ending, stop the range
    if (rangesEnding isn't empty) {
        // add the range up to this boundary
        resultRanges.add(beginningBoundary, boundary, [collected values from rangesInUse]

        for (range in rangesEnding) {
            usedRanges.add(range)
            rangesInUse.remove(range)
        }

        if (rangesInUse isn't empty) {
            // some ranges didn't end; note that we are starting a new range
            beginningBoundary = boundary + 1
        }
        else {
            beginningBoundary = nil
        }
    }
}

Unit test:

At the end, resultRanges should have the results you're looking for, unusedRanges and rangesInUse should be empty, beginningBoundary should be nil, and usedRanges should contain what unusedRanges used to contain (but sorted by range.end).

Chockablock answered 10/3, 2009 at 4:1 Comment(0)
P
0

Here's a solution using the pandas library. This function splits intervals that may overlap into non-overlapping sub-intervals; and keeps track of which intervals contain each sub-interval. This answer was based on algorithms found here and here.

import pandas as pd

def split_overlapping_intervals(df, lb, ub, intvl_id=None):
    """Split bounded intervals that may overlap into non-overlapping sub-intervals. 
    Keep track of which intervals contain each sub-interval."""
    # ensure intvl_id meets our high standards
    if intvl_id is None:
        intvl_id = 'intvl_id'
        df[intvl_id] = range(len(df.index))
    elif df.duplicated(subset=[intvl_id]).any():
            raise ValueError('intvl_id contains duplicate values')
    
    # reshape intervals to get sub-intervals
    df2 = pd.concat([df.drop(columns=[ub])
                       .assign(is_upper_bound=False),
                     df.drop(columns=[lb])
                       .rename(columns={ub: lb})
                       .assign(is_upper_bound=True)])
    df2 = df2.sort_values([lb, 'is_upper_bound'])
    df2[ub] = df2[lb].shift(-1) # the next bound is the new upper bound
    
    # get list of original degenerate intervals (intervals that contains a single value)
    og_degen_intvls = df.loc[df[lb]==df[ub], intvl_id].to_list()
    
    # for each sub-interval x, get list of interval(s) which contain x
    ci_list = [] # list of containing intervals
    def get_containing_intervals(row):
        # get containing interval
        if row['is_upper_bound']:
            ci_list.remove(row[intvl_id])
        else:
            ci_list.append(row[intvl_id])
        # return empty list if the current interval is degenerate, 
        # but ci_list doesn't contain any original degenerate intervals
        if (row[lb]==row[ub]) & (~any(x in og_degen_intvls for x in ci_list)):
            return []
        else:
            return ci_list.copy()
    df2[intvl_id] = df2.apply(lambda row: get_containing_intervals(row), axis=1)
    
    # drop where the sub-interval is not contained by any intervals (where intvl_id==[])
    df2 = df2[df2[intvl_id].astype(bool)]
    
    return df2.drop(columns=['is_upper_bound'])


if __name__=="__main__":
    lb = 'lower_bound'
    ub = 'upper_bound'
    df = pd.DataFrame(data={'rangelabel': ['A', 'B', 'C', 'D'],
                            lb: [0, 0, 95, 120],
                            ub: [100, 75, 150, 130]})

    newdf = split_overlapping_intervals(df, lb, ub, 'rangelabel')
    print(df)
    print(newdf)

Output:

  rangelabel  lower_bound  upper_bound
0          A            0          100
1          B            0           75
2          C           95          150
3          D          120          130
  rangelabel  lower_bound  upper_bound
0     [A, B]            0         75.0
1        [A]           75         95.0
2     [A, C]           95        100.0
3        [C]          100        120.0
4     [C, D]          120        130.0
5        [C]          130        150.0
Poach answered 1/11, 2023 at 20:16 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.