Find all date ranges for overlapping start and end dates in R [duplicate]
Asked Answered
L

4

3

I have a data frame that looks like this:

w<-read.table(header=TRUE,text="
start.date   end.date
2006-06-26 2006-07-24
2006-07-19 2006-08-16
2007-06-09 2007-07-07
2007-06-24 2007-07-22
2007-07-03 2007-07-31
2007-08-04 2007-09-01
2007-08-07 2007-09-04
2007-09-05 2007-10-03
2007-09-14 2007-10-12
2007-10-19 2007-11-16
2007-11-17 2007-12-15
2008-06-18 2008-07-16
2008-06-28 2008-07-26
2008-07-11 2008-08-08
2008-07-23 2008-08-20")

I'm trying to get an output that will combine overlapping start and end dates into one date range. So for the example set, I'd like to get:

w<-read.table(header=TRUE,text="
start.date   end.date
2006-06-26 2006-08-16
2007-06-09 2007-07-31
2007-08-04 2007-09-04
2007-09-05 2007-10-12
2007-10-19 2007-11-16
2007-11-17 2007-12-15
2008-06-18 2008-08-20")

The question is similar to Date roll-up in R, but I don't need to do any sort of group by on mine, so the answer there is confusing.

Also, the code that was suggested in response to my question below does not work for certain parts of my data frame such as:

x<-read.table(header=TRUE,text="start.date   end.date
2006-01-19 2006-01-20
2006-01-25 2006-01-29
2006-02-24 2006-02-25
2006-03-15 2006-03-22
2006-04-29 2006-04-30
2006-05-24 2006-05-25
2006-06-26 2006-08-16
2006-07-05 2006-07-10
2006-07-12 2006-07-21
2006-08-13 2006-08-15
2006-08-18 2006-08-19
2006-08-28 2006-09-02")

I am confused why it does not?

Leaved answered 17/11, 2016 at 5:13 Comment(2)
https://mcmap.net/q/1329259/-date-roll-up-in-r library(dplyr); w %>% mutate(gr = cumsum(start.date-lag(end.date, default=1)>=0 )) %>% group_by(gr) %>% summarise(start.date = min(start.date), end.date = max(end.date))Megasporangium
This isn't a duplicate of that questions @RonakShah - That question is looking at continuation of dates, mine is looking at overlapping dates.Leaved
V
1

Try this:

w[] <- lapply(w, function(x) as.Date(x, '%Y-%m-%d'))
w <- w[order(w$start.date),] # sort the data by start dates if already not sorted
w$group <- 1:nrow(w) # common intervals should belong to same group
merge.indices <- lapply(2:nrow(w), function(x) {
                    indices <- which(findInterval(w$end.date[1:(x-1)], w$start.date[x])==1)
                    if (length(indices) > 0) indices <- c(indices, x) 
                    indices})
# assign the intervals the right groups
for (i in 1:length(merge.indices)) {
  if (length(merge.indices[[i]]) > 0) {
    w$group[merge.indices[[i]]] <- min(w$group[merge.indices[[i]]])
  }
}

do.call(rbind, lapply(split(w, w$group), function(x) data.frame(start.date=min(x[,1]), end.date=max(x[,2]))))

It conceptually merges overlapping intervals into the same group as shown below: enter image description here

with output:

   start.date   end.date
1  2006-01-19 2006-01-20
2  2006-01-25 2006-01-29
3  2006-02-24 2006-02-25
4  2006-03-15 2006-03-22
5  2006-04-29 2006-04-30
6  2006-05-24 2006-05-25
7  2006-06-26 2006-08-16
11 2006-08-18 2006-08-19
12 2006-08-28 2006-09-02
Varro answered 17/11, 2016 at 8:9 Comment(5)
When I input new data into the one code that both of you proposed, it seems to not work anymore @Megasporangium I'm not sure how to insert sample data into comments, so I will update my question with new data.Leaved
can you please explain why it does not work, i think it's still giving the desired output, if I understood our requirement correctly.Varro
in the last sample data I provided, rows 8 - 10 should have been folded into row 7 "2006-06-26 2006-08-16", but the code instead now creates a new date range of "2006-06-26 2006-07-10."Leaved
These three start/end date ranges: "2006-07-05 2006-07-10", "2006-07-12 2006-07-21", "2006-08-13 2006-08-15" should be a part of the "2006-06-26 2006-08-16" date range because they overlap with "2006-06-26 2006-08-16". They shouldn't be included in the resulting data frame, but they still appear. Let me know if I can be clearer.Leaved
got you, updated the post with a more generic solution, it should work both with earlier and current dataset.Varro
S
4

The IRanges package on Bioconductor includes the function reduce which can be utilized to combine overlapping start and end dates into one date range.

IRanges works on integer ranges so you have to convert the data from class Date to integerand back. This can be wrapped up in a function:

collapse_date_ranges <- function(w, min.gapwidth = 1L) {
  library(data.table)
  library(magrittr)
  IRanges::IRanges(start = as.integer(as.Date(w$start.date)), 
                   end = as.integer(as.Date(w$end.date))) %>% 
    IRanges::reduce(min.gapwidth = min.gapwidth) %>% 
    as.data.table() %>% 
    .[, lapply(.SD, lubridate::as_date),
      .SDcols = c("start", "end")]
}

collapse_date_ranges(w, 0L)
#        start        end
#1: 2006-06-26 2006-08-16
#2: 2007-06-09 2007-07-31
#3: 2007-08-04 2007-09-04
#4: 2007-09-05 2007-10-12
#5: 2007-10-19 2007-11-16
#6: 2007-11-17 2007-12-15
#7: 2008-06-18 2008-08-20

collapse_date_ranges(x, 0L)
#        start        end
#1: 2006-01-19 2006-01-20
#2: 2006-01-25 2006-01-29
#3: 2006-02-24 2006-02-25
#4: 2006-03-15 2006-03-22
#5: 2006-04-29 2006-04-30
#6: 2006-05-24 2006-05-25
#7: 2006-06-26 2006-08-16
#8: 2006-08-18 2006-08-19
#9: 2006-08-28 2006-09-02

Explanation

  • In order to avoid name clashes, I prefer the double colon operators :: to access single functions from the IRanges package over using library(IRanges) which loads the whole package.
  • The start and end dates are converted to integer (as.Date is just to ensure the proper class) and create an IRanges object.
  • reduce does all the hard work. The parameter min.gapwidth is required here as reduce collapses adjacent ranges by default (see below).
  • Finally, the result is converted back from integer to date. (You may use dplyr instead of data.table as well.)
  • The solution works for both sample data sets w and x. x includes a special case where one date range embeds other date ranges to full extent.

Appendix: Collapsing adjacent date ranges

The sample result given by the OP shows that adjacent data ranges should not be collapsed, e.g., the range 2007-10-19 to 2007-11-16 is separate from the range 2007-11-17 to 2007-12-15 although the second range starts only one day after the first one has ended.

Just in case, adjacent date ranges are to be collapsed this can be achieved by using the default value of the min.gapwidth parameter:

collapse_date_ranges(w)
#        start        end
#1: 2006-06-26 2006-08-16
#2: 2007-06-09 2007-07-31
#3: 2007-08-04 2007-10-12
#4: 2007-10-19 2007-12-15
#5: 2008-06-18 2008-08-20
Stirrup answered 21/11, 2016 at 10:53 Comment(0)
V
1

Try this:

w[] <- lapply(w, function(x) as.Date(x, '%Y-%m-%d'))
w <- w[order(w$start.date),] # sort the data by start dates if already not sorted
w$group <- 1:nrow(w) # common intervals should belong to same group
merge.indices <- lapply(2:nrow(w), function(x) {
                    indices <- which(findInterval(w$end.date[1:(x-1)], w$start.date[x])==1)
                    if (length(indices) > 0) indices <- c(indices, x) 
                    indices})
# assign the intervals the right groups
for (i in 1:length(merge.indices)) {
  if (length(merge.indices[[i]]) > 0) {
    w$group[merge.indices[[i]]] <- min(w$group[merge.indices[[i]]])
  }
}

do.call(rbind, lapply(split(w, w$group), function(x) data.frame(start.date=min(x[,1]), end.date=max(x[,2]))))

It conceptually merges overlapping intervals into the same group as shown below: enter image description here

with output:

   start.date   end.date
1  2006-01-19 2006-01-20
2  2006-01-25 2006-01-29
3  2006-02-24 2006-02-25
4  2006-03-15 2006-03-22
5  2006-04-29 2006-04-30
6  2006-05-24 2006-05-25
7  2006-06-26 2006-08-16
11 2006-08-18 2006-08-19
12 2006-08-28 2006-09-02
Varro answered 17/11, 2016 at 8:9 Comment(5)
When I input new data into the one code that both of you proposed, it seems to not work anymore @Megasporangium I'm not sure how to insert sample data into comments, so I will update my question with new data.Leaved
can you please explain why it does not work, i think it's still giving the desired output, if I understood our requirement correctly.Varro
in the last sample data I provided, rows 8 - 10 should have been folded into row 7 "2006-06-26 2006-08-16", but the code instead now creates a new date range of "2006-06-26 2006-07-10."Leaved
These three start/end date ranges: "2006-07-05 2006-07-10", "2006-07-12 2006-07-21", "2006-08-13 2006-08-15" should be a part of the "2006-06-26 2006-08-16" date range because they overlap with "2006-06-26 2006-08-16". They shouldn't be included in the resulting data frame, but they still appear. Let me know if I can be clearer.Leaved
got you, updated the post with a more generic solution, it should work both with earlier and current dataset.Varro
N
0

Solution.

w<-read.table(header=TRUE, stringsAsFactor=F, text="
start.date   end.date
2006-06-26 2006-07-24
2006-07-19 2006-08-16
2007-06-09 2007-07-07
2007-06-24 2007-07-22
2007-07-03 2007-07-31
2007-08-04 2007-09-01
2007-08-07 2007-09-04
2007-09-05 2007-10-03
2007-09-14 2007-10-12
2007-10-19 2007-11-16
2007-11-17 2007-12-15
2008-06-18 2008-07-16
2008-06-28 2008-07-26
2008-07-11 2008-08-08
2008-07-23 2008-08-20")

w <- data.frame(lapply(w, as.Date))

library(lubridate)

idx.rle <- rle(as.numeric(sapply(1:(nrow(w)-1), function(i) int_overlaps(interval(w[i,1],w[i,2]), interval(w[i+1,1],w[i+1,2])))))




i.starts <- nrow(w)-rev(cumsum(rev(idx.rle$length)))
i.ends <-  1+cumsum(idx.rle$length)

 do.call(rbind,
    lapply(1:length(idx.rle$lengths),
           function(i) {
               i.start <- i.starts[i]
               i.end <-  i.ends[i]
               if(idx.rle$values[i]==1) {
                   d <- data.frame(start.date=w[i.start,1],
                                   end.date=max(w[i.start:i.end,2]) );
                   names(d) <- names(w);
                   d
               } else {
                   if(idx.rle$lengths[i]>1&i>1&i<length(idx.rle$lengths)) {
                       data.frame(w[(i.start+1):(i.end-1),] )
                   } else {
                       if (idx.rle$lengths[i]>=1&i==1) {
                           data.frame(w[(i.start):(i.end-1),])
                       } else {
                           if(idx.rle$lengths[i]>=1&i==length(idx.rle$lengths)) data.frame(w[(i.start+1):(i.end),] ) 
                       }
                   }
               }
           }))
Noonan answered 19/11, 2016 at 14:32 Comment(2)
I receive this error: Error in match.names(clabs, names(xi)) : names do not match previous namesLeaved
You use not exact "start.date"&"end.date" as names in "w". I've fixed it.Noonan
B
0

For anyone referring back to this older question, here's a newer option using a package dedicated to working with intervals:

library(tidyverse)
library(ivs)

w <- read.table(header = TRUE, text = "
start.date   end.date
2006-06-26 2006-07-24
2006-07-19 2006-08-16
2007-06-09 2007-07-07
2007-06-24 2007-07-22
2007-07-03 2007-07-31
2007-08-04 2007-09-01
2007-08-07 2007-09-04
2007-09-05 2007-10-03
2007-09-14 2007-10-12
2007-10-19 2007-11-16
2007-11-17 2007-12-15
2008-06-18 2008-07-16
2008-06-28 2008-07-26
2008-07-11 2008-08-08
2008-07-23 2008-08-20")

w |> 
  mutate(iv = iv(start.date, end.date)) |> 
  summarise(iv = iv_groups(iv), .groups = "drop")
#>                         iv
#> 1 [2006-06-26, 2006-08-16)
#> 2 [2007-06-09, 2007-07-31)
#> 3 [2007-08-04, 2007-09-04)
#> 4 [2007-09-05, 2007-10-12)
#> 5 [2007-10-19, 2007-11-16)
#> 6 [2007-11-17, 2007-12-15)
#> 7 [2008-06-18, 2008-08-20)

Created on 2022-05-27 by the reprex package (v2.0.1)

Baras answered 27/5, 2022 at 11:58 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.