As the intervals are streaming they must be persisted in some sort of collection that is efficient for
- inserting new intervals
- finding overlapping intervals
- merging overlapping intervals as new intervals are added
The merge operation needs to accommodate the following cases
no overlap insert (7, 9)
into [(1, 5), (10, 15)]
gives [(1,5), (7, 9), (10,15)]
some overlap insert (4, 6)
into [(1, 5), (10, 15)]
gives [(1,6), (10,15)]
bridge overlap insert (4, 11)
into [(1, 5), (10, 15), (20, 30)]
gives [(1, 15), (20, 30)]
supreme overlap insert (1, 20)
into [(2, 5), (7, 9), (13, 15)]
gives [(1, 20)]
full overlap insert (5, 6)
into [(1, 7), (10, 19)]
gives [(1, 7), (10, 19)]
and many in-between cases and variants of what is stated above
At the end of an insert operation this collection must contain only non-overlapping intervals. It is obvious that keeping these non-overlapping intervals sorted in this collection is necessary to efficiently find and merge overlapping intervals. Any unsorted solution would require searching every interval anytime anything is inserted, and then repeating that process every time a merge occurs. Keeping in mind elements must remain sorted, any type of hashing or any unordered data structure will also not help us. The best we can do here is O(nlgn)
.
Using a sorted array (or its arraylist or vector variants) one approach could be to binary search for the newly inserted interval's start, then search left and right and merge where necessary. This would result in O(nlgn)
for searching for where to place the element, however inserting in the middle of an array and removing elements from the array requires re-indexing every element in the array anytime an insert or a merge occurs. This is too expensive of an operation, it doesn't scale well, and also defeats the purpose of keeping the elements sorted degrading to the performance of the naive unsorted solution.
One way to solve the insert/remove problem is to use a sorted linked-list where inserts and deletes are cheap, however this would make binary search very inefficient if not impossible and would also defeat the purpose of keeping the elements sorted, again degrading to performance of the naive unsorted solution.
It is worth researching interval trees defined in wikipedia, however this data structure does not merge intervals, it only can query for them so this does not support our use case either.
The best way is to use a binary tree to store the intervals, where the interval has a defined comparator method that returns that the elements are equal if there is any overlap. This makes it very efficient to finding overlapping intervals in O(lgn)
time.
An example comparator class looks like this
public class Interval<T extends Comparable<T>> implements Comparable<Interval<T>>
{
private final T start_;
private final T end_;
public Interval(final T start, final T end)
{
this.start_ = start;
this.end_ = end;
}
@Override
public int compareTo(Interval other) {
if (other.start_.compareTo(this.end_) == -1) {
return -1;
} else if (other.start_.compareTo(this.end_) == 1) {
return 1;
} else {
return 0;
}
}
}
A treeset (or set or any of its variants) can either be extended or a composited to fit the insert operation. Note below I am using a treemap
variant since java does not allow returning an element in a set.
public class IntervalCollection
{
private Map<Interval, Interval> intervalCollection_ = new TreeMap<>();
public void insert(Interval interval)
{
while (intervalCollection_.containsKey(interval)) {
Interval other = intervalCollection_.get(interval);
intervalCollection_.remove(interval);
interval = new Interval(Math.min(interval.start_, other.start_),
Math.max(interval.end_, other.end_));
}
intervalCollection_.put(interval, interval);
}
}
This collection uses the power of the language at hand to do most of the hard work for you and provides efficient insertion and merging for streaming intervals.