Here's my solution.
It's based on the algorithm here (https://softwareengineering.stackexchange.com/questions/363091/split-overlapping-ranges-into-all-unique-ranges?newreg=93383e379afe4dd3a595480528ee1541), but uses data.table, shift, and vectorized ifelse statements for efficiency. It also differs from the alrgorithm in that my code allows this operation to be performed separately for multiple datasets identified by an id_column. My approach also ignores keeping track of rows (ie "attribute") since it's not necessary to define this when the intervals can be easily merged back to the original data using foverlaps
anyway. foverlaps also serves the purpose of excluding gaps
Please tell me whether you see any inefficiencies
remove_overlaps <- function(x, start_column, end_column, id_column=NULL){
xd <- melt(x[,c(start_column,end_column,id_column),with=FALSE],id=id_column)
xd[variable==start_column,end:=FALSE]
xd[variable==end_column,end:=TRUE]
setorderv(xd,c(id_column, "value","end"))
xd[,end_next:=shift(end,type="lead"),by=id_column]
xd[,value_next:=shift(value,type="lead"),by=id_column]
#excluding end_next when missing should cause this to ignore the last row in each group
#because this element will be NA as defined by shift
temp <- xd[,.SD[!is.na(end_next),list(
start=ifelse(!end,value,value+1),
end=ifelse(!end_next,value_next-1,value_next)
)],by=id_column]
temp <- temp[end>=start]
setnames(temp , c("start","end"),c(start_column,end_column))
setkeyv(temp,c(id_column,start_column,end_column))
out <- foverlaps(x,temp)
setorderv(out, c(id_column,start_column,
paste0("i.",start_column),
paste0("i.",end_column)
))
out
}
remove_overlaps(x, start_column="start1",end_column="end1",id_column="id1")
Also, for what it's worth I don't think the suggestion linked on that page is correct on how to exclude gaps.
This answer doesn't take account of gaps (gaps should not appear in
output), so I refined it: * If e=false, add a to S. If e=true, take
away a from S. * Define n'=n if e=false or n'=n+1 if e=true * Define
m'=m-1 if f=false or m'=m if f=true * If n' <= m' and (e and not f) =
false, output (n',m',S), otherwise output nothing. – silentman.it Aug
23 '18 at 12:19
Here is a second version of this code algorithm implemented in R: remove_overlaps doesn't explicitly use silentman.it's suggestion to exclude gaps, whereas remove_overlaps1 uses that suggestion. Note that both functions do exclude gaps via the subsequent call to foverlaps, which only returns intervals if they partially match to those in x (the original data).
library(data.table)
remove_overlaps1 <- function(x, start_column, end_column, id_column=NULL){
xd <- melt(x[,c(start_column,end_column,id_column),with=FALSE],id=id_column)
xd[variable==start_column,end:=FALSE]
xd[variable==end_column,end:=TRUE]
setorderv(xd,c(id_column, "value","end"))
xd[,end_next:=shift(end,type="lead"),by=id_column]
xd[,value_next:=shift(value,type="lead"),by=id_column]
###subset to rows where (e & !f) = FALSE, as per comment suggestion on linked answer
temp <- xd[,.SD[!is.na(end_next)&!(end & !end_next),list(
start=ifelse(!end,value,value+1),
end=ifelse(!end_next,value_next-1,value_next)
)],by=id_column]
temp <- temp[end>=start]
setnames(temp , c("start","end"),c(start_column,end_column))
setkeyv(temp,c(id_column,start_column,end_column))
out <- foverlaps(x,temp) #this should exclude gaps since foverlaps by default subsets to
setorderv(out, c(id_column,start_column,
paste0("i.",start_column),
paste0("i.",end_column)
))
out
}
Example data:
library(data.table)
x <-
structure(
list(
native_id = c(
"1",
"1",
"1",
"1",
"1"
),
n_start_date = c(14761, 14775,
14789, 14803, 14817),
n_end_date = c(14776, 14790, 14804, 14818,
14832),
obs = c(
31.668140525481,
34.8623263656539,
35.0841466093899,
37.2281249364127,
36.3726151694052
)
),
row.names = c(NA,-5L),
class = "data.frame",
.Names = c("native_id",
"n_start_date", "n_end_date", "obs")
)
setDT(x)
> x
native_id n_start_date n_end_date obs
1: 1 14761 14776 31.66814
2: 1 14775 14790 34.86233
3: 1 14789 14804 35.08415
4: 1 14803 14818 37.22812
5: 1 14817 14832 36.37262
Results:
> remove_overlaps(x, start_column="n_start_date",end_column="n_end_date",id_column="native_id")
native_id n_start_date n_end_date i.n_start_date i.n_end_date obs
1: 1 14761 14774 14761 14776 31.66814
2: 1 14775 14776 14761 14776 31.66814
3: 1 14775 14776 14775 14790 34.86233
4: 1 14777 14788 14775 14790 34.86233
5: 1 14789 14790 14775 14790 34.86233
6: 1 14789 14790 14789 14804 35.08415
7: 1 14791 14802 14789 14804 35.08415
8: 1 14803 14804 14789 14804 35.08415
9: 1 14803 14804 14803 14818 37.22812
10: 1 14805 14816 14803 14818 37.22812
11: 1 14817 14818 14803 14818 37.22812
12: 1 14817 14818 14817 14832 36.37262
13: 1 14819 14832 14817 14832 36.37262
Seemingly incorrect, excludes too many intervals:
> remove_overlaps1(x, start_column="n_start_date",end_column="n_end_date",id_column="native_id")
native_id n_start_date n_end_date i.n_start_date i.n_end_date obs
1: 1 14761 14774 14761 14776 31.66814
2: 1 14775 14776 14761 14776 31.66814
3: 1 14775 14776 14775 14790 34.86233
4: 1 14789 14790 14775 14790 34.86233
5: 1 14789 14790 14789 14804 35.08415
6: 1 14803 14804 14789 14804 35.08415
7: 1 14803 14804 14803 14818 37.22812
8: 1 14817 14818 14803 14818 37.22812
9: 1 14817 14818 14817 14832 36.37262
10: 1 14819 14832 14817 14832 36.37262