how to efficiently merge int ranges in a stream?
Asked Answered
U

5

11

We are given a continuous stream of integer ranges like [1,3], [5,10], [2,6], ... when each new range comes, we need to check with existing ranges and see if it overlaps with any existing range, and if any overlap is found, all the overlapping ranges are removed and the merged range is inserted instead. We need an efficient algorithm for this. Note that the range comes one at a time which forms a stream...

Was asked this question during an interview. Ideas?

The intent is to merge overlapping ranges into one. for example, if we have 3 ranges coming in the following order: [1,3], [2,6], [5,10]. Then we first merge the first two into [1,6], then merge with the 3rd and it becomes [1,10].

Urbanist answered 20/1, 2011 at 14:16 Comment(4)
Your question is unclear. Do you want the output to be a single stream like [1,2,3,5,6,7,8,9,10,2,3,4,5,6]?Bribery
@Jim Mischel: I was assuming that each tuple provided a start and end of a range, therefore, after the first tuple the resulting range would include [1,2,3]. After the second tuple the resulting range would be [1,2,3,5,6,7,8,9,10] (no 4).Greaves
to my eyes a possible logical output should be a list of (ordered ?) ranges merging overlapping rangesSwordfish
The intent is to merge overlapping ranges into one. for example, if we have 3 ranges coming in the following order: [1,3], [2,6], [5,10]. Then we first merge the first two into [1,6], then merge with the 3rd and it becomes [1,10].Urbanist
I
11

The standard algorithm for this problem is an interval tree.

Inclination answered 20/1, 2011 at 21:27 Comment(2)
Thanks. I just took a look at interval tree, and while it only requires O(logn) time when looking up an overlapping entry for a new range, inserting a new range could overlap with a lot of entries in the tree, causing all of them being deleted. The cost to delete these overlapping entries and rotating the tree to a balanced state would be non-trival. Or am I missing anything?Urbanist
Yes, inserting a new range might result in the deletion of a lot of other ranges, but you can amortize that cost against the cost to insert those other ranges in the first place. I haven't thought about keeping the tree balanced, but I imagine there is a way to do it ala AVL or red-black trees.Inclination
A
-1

When a new tuple comes in (newstart,newend), perform a binary search fon newstart-1 in the list of existing closing elements and likewise for newend+1 in the list of existing opening elements.

Merge with any matching ranges.

If no range matches, insert between the two closest ranges.

Update: Scratch that, I was solving the wrong problem: merging touching ranges. But the overlapping range solution won't be too dissimilar.

  1. Binary search for the greatest existing starting element where start(n) <= newstart
  2. Binary search for the smallest existing ending element where end(n) >= newstart
  3. If the two return the same index, your tuple start is mergeable with the nth entry, replace newstart with start(n)
  4. Binary search for the greatest existing starting element where start(m) <= newend
  5. Binary search for the smallest existing ending element where end(m) >= newend
  6. If the two return the same index, your tuple end is mergeable with the mth entry, replace newend with end(m)
  7. Replace all entries between the nth and mth index with the (newstart,newend) tuple.
Aribold answered 20/1, 2011 at 15:21 Comment(3)
That's similar to my answer during the interview. The problem with this is it gradually creates duplicate entries at step 7 which as time goes will cause a waste of spaces.Urbanist
@Urbanist Duplicate entries? Where?Aribold
binary search is log(n), but what data structure for merging intervals. Array removing a interval takes O(n). Linked List cannot work for binary search.Magistery
J
-1

You can represent the range as a single sequence of alternating "on" and "off" points. Whenever a new range arrives, its start and end points are searched for and the appropriate actions of merging or inserting taken.

In order to get good performance for the required operations—search, insert, and delete—one would perhaps use something like a B-tree here.

Julianajuliane answered 20/1, 2011 at 15:42 Comment(0)
R
-1

Interval tree server the purpose well, but here is another approach that I have used to add to the range.

The assumption is that the list to which the new tuple(a rane) is being added is either empty or has no overlapping ranges. Secondly, input is of the form a, b where a <= b You can convert the list of tuples to a single list of numbers and then add the new tuple to it.

Let rangeList be the current list of ranges. eg. [1, 4, 6, 10, 12, 14] would mean a range list of [(1, 4), (6, 10), (12, 14)]

  1. If the list is empty, simply insert the elements of the tuple in the list and you are done
  2. Find the position of the a, b element in the list using binary search. (If the element does not exist, return the position of the least element greater than the searched for number)
  3. Let the positions returned be pos_a, pos_b for the tuple elements a, b respectively.
  4. If pos_a is even, list_A = [a]
  5. If pos_b is even, list_B = [b]
  6. The new list = rangeList[0:pos_a] + list_A + list_B + rangeList[b:end] and you are done.

By converting the list of tuples to a single list, we eliminate the need for comparing the new elements with 2 different lists. Thus finding its position easily and by checking for odd or even, tells us whether it lies between an existing range

def subroutine(rangeList, currTuple)
"""
rangeList: Existing list of non overlapping tuples
currTuple: The range tuple to be added
"""
    if not rangeList:
        rangeList.extend(currTuple)
    else:
        a, b = binSearch(currTuple, rangeList)
        list_changed = rangeList[0:a]
        if a%2 == 0:
            list_changed.append(currTuple[0])
        if b%2 == 0:
            list_changed.append(currTuple[1])
        list_changed.extend(rangeList[b:])
    return list_changed
Rhiannonrhianon answered 2/8, 2012 at 14:38 Comment(0)
C
-1

As the intervals are streaming they must be persisted in some sort of collection that is efficient for

  • inserting new intervals
  • finding overlapping intervals
  • merging overlapping intervals as new intervals are added

The merge operation needs to accommodate the following cases

  • no overlap insert (7, 9) into [(1, 5), (10, 15)] gives [(1,5), (7, 9), (10,15)]

  • some overlap insert (4, 6) into [(1, 5), (10, 15)] gives [(1,6), (10,15)]

  • bridge overlap insert (4, 11) into [(1, 5), (10, 15), (20, 30)] gives [(1, 15), (20, 30)]

  • supreme overlap insert (1, 20) into [(2, 5), (7, 9), (13, 15)] gives [(1, 20)]

  • full overlap insert (5, 6) into [(1, 7), (10, 19)] gives [(1, 7), (10, 19)]

  • and many in-between cases and variants of what is stated above

At the end of an insert operation this collection must contain only non-overlapping intervals. It is obvious that keeping these non-overlapping intervals sorted in this collection is necessary to efficiently find and merge overlapping intervals. Any unsorted solution would require searching every interval anytime anything is inserted, and then repeating that process every time a merge occurs. Keeping in mind elements must remain sorted, any type of hashing or any unordered data structure will also not help us. The best we can do here is O(nlgn).

Using a sorted array (or its arraylist or vector variants) one approach could be to binary search for the newly inserted interval's start, then search left and right and merge where necessary. This would result in O(nlgn) for searching for where to place the element, however inserting in the middle of an array and removing elements from the array requires re-indexing every element in the array anytime an insert or a merge occurs. This is too expensive of an operation, it doesn't scale well, and also defeats the purpose of keeping the elements sorted degrading to the performance of the naive unsorted solution.

One way to solve the insert/remove problem is to use a sorted linked-list where inserts and deletes are cheap, however this would make binary search very inefficient if not impossible and would also defeat the purpose of keeping the elements sorted, again degrading to performance of the naive unsorted solution.

It is worth researching interval trees defined in wikipedia, however this data structure does not merge intervals, it only can query for them so this does not support our use case either.

The best way is to use a binary tree to store the intervals, where the interval has a defined comparator method that returns that the elements are equal if there is any overlap. This makes it very efficient to finding overlapping intervals in O(lgn) time.

An example comparator class looks like this

public class Interval<T extends Comparable<T>> implements Comparable<Interval<T>>
{
    private final T start_;
    private final T end_;

    public Interval(final T start, final T end)
    {
        this.start_ = start;
        this.end_ = end;
    }

    @Override 
    public int compareTo(Interval other) {
        if (other.start_.compareTo(this.end_) == -1) { 
            return -1;
        } else if (other.start_.compareTo(this.end_) == 1) { 
            return 1; 
        } else { 
            return 0;
        }
    }
}

A treeset (or set or any of its variants) can either be extended or a composited to fit the insert operation. Note below I am using a treemap variant since java does not allow returning an element in a set.

public class IntervalCollection
{
    private Map<Interval, Interval> intervalCollection_ = new TreeMap<>();

    public void insert(Interval interval)
    {
        while (intervalCollection_.containsKey(interval)) {

            Interval other = intervalCollection_.get(interval);
            intervalCollection_.remove(interval);

            interval = new Interval(Math.min(interval.start_, other.start_),
                Math.max(interval.end_, other.end_));
        }

        intervalCollection_.put(interval, interval);
    }
}

This collection uses the power of the language at hand to do most of the hard work for you and provides efficient insertion and merging for streaming intervals.

Claire answered 27/7, 2018 at 1:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.