Find overlapping dates for each ID and create a new row for the overlap
Asked Answered
S

3

9

I would like to find the overlapping dates for each ID and create a new row with the overlapping dates and also combine the characters (char) for the lines. It is possible that my data will have >2 overlaps and need >2 combinations of characters. eg. ERM

Data:

ID    date1         date2       char
15  2003-04-05  2003-05-06      E
15  2003-04-20  2003-06-20      R
16  2001-01-02  2002-03-04      M
17  2003-03-05  2007-02-22      I   
17  2005-04-15  2014-05-19      C
17  2007-05-15  2008-02-05      I
17  2008-02-05  2012-02-14      M
17  2010-06-07  2011-02-14      V
17  2010-09-22  2014-05-19      P
17  2012-02-28  2013-03-04      R

Output I would like:

ID  date1       date2           char
15  2003-04-05  2003-04-20      E
15  2003-04-20  2003-05-06      ER
15  2003-05-06  2003-06-20      R
16  2001-01-02  2002-03-04      M
17  2003-03-05  2005-04-15      I
17  2005-04-15  2007-02-22      IC
17  2005-04-15  2007-05-15      C   
17  2007-05-15  2008-02-05      CI
17  2008-02-05  2012-02-14      CM
17  2010-06-07  2011-02-14      CV
17  2010-09-22  2014-05-19      CP
17  2012-02-28  2013-03-04      CR
17  2014-05-19  2014-05-19      P
17  2010-06-07  2012-02-14      MV
17  2010-09-22  2011-02-14      VP
17  2012-02-28  2013-03-04      RP

What I have tried: I have tried using subtracting date 2 from the current row from the row below using:

df$diff <- c(NA,df[2:nrow(tdf), "date1"] - df[1:(nrow(df)-1), "date2"])

Then to determine the overlaps between the rows:

df$overlap[which(df$diff<1)] <-1
df$overlap.up <- c(df$overlap[2:(nrow(df))], "NA")
df$overlap.final[which(df$overlap==1 | df$overlap.up==1)] <- 1

I then selected those that had an overlap.final==1 and put them into another dataframe and found the overlaps for each ID.

However, I have realized that this is way too simplistic and flawed, because it only selects overlaps that occur sequentially (using the difference in dates in the first step). What I need to do is to take the series of dates for each ID and loop through each combination to determine if there is an overlap and then, if so, record that start and end date and create a new character “char” signalling what was combined during those two dates. I think I need a loop to do this.

I tried to create a loop to find the overlap intervals between date1 and date 2

df <- df[which(!duplicated(df$ ID)),]

for (i in 1:nrow(df)) {
  tmp <- length(which(df $ID[i] & (df$date1[i] >df$date1 & df$date1[i]< df$date2) | (df$date2[i] < df$date2&  df$date2[i]> df$date1))) >0
  df$int[i]<- tmp

}

However this does not work.

After identifying the overlapping intervals, I need to create new rows for each new start and end date and a new character that represents the overlap.

Another version of the loop I have tried to identify overlaps:

for (i in 1:nrow(df)) {
  if (df$ID[i]==IDs$ID){
  tmp <- length(df, df$ ID[i]==IDs$ & (df$date1[i]> df$date1 & df$date1 [i]< df$date2 | df$date2[i] < df$date2 &  df$date2[i]> df$date1)) >0
  df$int[i]<- tmp
  }
}
Sparid answered 11/9, 2017 at 8:39 Comment(6)
What have you tried and where do you get stuck?Algiers
The IRanges package on bioconductor has some nice functions for this.Holly
#24480531Rodriguez
Are your dates are of some nice time/date data format or are they strings? I mean: the basic idea is to look whether date1[i1]>date1[i2] && date1[i1]<date2[i2] and the same for date2[i1]. Then the both rows are combined. The date1 and date2 are set to min of both date1 and date2, respectively. Your question to >2 overlapping intervals: ERI is such a case. Do you want four new rows (ER, EI, RI, and ERI) and one new row (ERI)? The latter might yield some conflicts. I would solve it with some recursive function call ... .Finn
Hi @RomanLuštrik, I have edited my post to reflect what I've tried so far.Sparid
Hi @daniel.neumann, I have tried something similar to what you are suggesting to find the overlapping dates, without success, yet. Yes the dates are in "%Y-%m-%d" format. Yes, I would like to create new rows that show the dates of the overlap, plus a combination of characters (in my data they are actually words) to reflect what the overlap was.Sparid
H
12

First, we create a data.table of all possible intervals for each ID.

All possible intervals means that we take all start and end dates of an ID and combine them in a sorted vector tmp. The unique values indicate all possible intersections (or breaks) of all given intervals of the ID on the time axis. For later joining, the breaks are re-arranged in one interval per row with a start and an end column:

library(data.table)
options(datatable.print.class = TRUE)
breaks <- DT[, {
  tmp <- unique(sort(c(date1, date2)))
  .(start = head(tmp, -1L), end = tail(tmp, -1L))
  }, by = ID]
breaks
       ID      start        end
    <int>     <IDat>     <IDat>
 1:    15 2003-04-05 2003-04-20
 2:    15 2003-04-20 2003-05-06
 3:    15 2003-05-06 2003-06-20
 4:    16 2001-01-02 2002-03-04
 5:    17 2003-03-05 2005-04-15
 6:    17 2005-04-15 2007-02-22
 7:    17 2007-02-22 2007-05-15
 8:    17 2007-05-15 2008-02-05
 9:    17 2008-02-05 2010-06-07
10:    17 2010-06-07 2010-09-22
11:    17 2010-09-22 2011-02-14
12:    17 2011-02-14 2012-02-14
13:    17 2012-02-14 2012-02-28
14:    17 2012-02-28 2013-03-04
15:    17 2013-03-04 2014-05-19

Then, a non-equi join is performed whereby the values are aggregated simultaneously on the join conditions (by = .EACHI is called grouping by each i, see this answer for a more detailed explanation):

DT[breaks, on = .(ID, date1 <= start, date2 >= end), paste(char, collapse = ""),  
   by = .EACHI, allow.cartesian = TRUE]
       ID      date1      date2     V1
    <int>     <IDat>     <IDat> <char>
 1:    15 2003-04-05 2003-04-20      E
 2:    15 2003-04-20 2003-05-06     ER
 3:    15 2003-05-06 2003-06-20      R
 4:    16 2001-01-02 2002-03-04      M
 5:    17 2003-03-05 2005-04-15      I
 6:    17 2005-04-15 2007-02-22     IC
 7:    17 2007-02-22 2007-05-15      C
 8:    17 2007-05-15 2008-02-05     CI
 9:    17 2008-02-05 2010-06-07     CM
10:    17 2010-06-07 2010-09-22    CMV
11:    17 2010-09-22 2011-02-14   CMVP
12:    17 2011-02-14 2012-02-14    CMP
13:    17 2012-02-14 2012-02-28     CP
14:    17 2012-02-28 2013-03-04    CPR
15:    17 2013-03-04 2014-05-19     CP

The result differs from the expected result posted by the OP but plotting the data convinces that the above result shows all possible overlaps:

library(ggplot2)
ggplot(DT) + aes(y = char, yend = char, x = date1, xend = date2) + 
  geom_segment() + facet_wrap("ID", ncol = 1L) 

enter image description here

Data

library(data.table)
DT <- fread(
  "ID    date1         date2       char
15  2003-04-05  2003-05-06      E
15  2003-04-20  2003-06-20      R
16  2001-01-02  2002-03-04      M
17  2003-03-05  2007-02-22      I   
17  2005-04-15  2014-05-19      C
17  2007-05-15  2008-02-05      I
17  2008-02-05  2012-02-14      M
17  2010-06-07  2011-02-14      V
17  2010-09-22  2014-05-19      P
17  2012-02-28  2013-03-04      R"
)
cols <- c("date1", "date2")
DT[, (cols) := lapply(.SD, as.IDate), .SDcols = cols]
Holly answered 12/9, 2017 at 9:54 Comment(4)
Thanks for this answer, I was having trouble applying foverlaps in the way I wanted but this worked. I'm unfamiliar with data.table grammar, can I ask what by = .EACHI does exactly in your non-equi join? And if I were only interested in overlapping pairs / dyads, is there any way I could return just two overlapping char in two separate columns? Thanks!Rampage
@Rampage Concerning your first question I have updated my answer with a link to a more detailed explanation. Concerning your second question I believe this can be solved as well. However, according to SO guidelines I kindly suggest that you pose a new question. Please, keep me posted. Thank you.Holly
Thanks for the quick reply and extra detail to your answer, that's very useful. I've posed this as a new question here if you think you can suggest a solution. Thanks!Rampage
Wow! I really enjoyed reading this answer. Beautiful and concise code! Thank you!Lakesha
F
1

Introduction

The for-loop you added to your question and the included comparison were a good start. The should be some additional brackets ( and ) in the date comparison. This for-loop-approach automatically considers new rows in the data frame. Therefore, you can get three-, four- and more-character strings in the char column.

Create input data

df = as.data.frame(list('ID'=c(15, 15, 16, 17, 17, 17, 17, 17, 17, 17),
                        'date1'=as.Date(c('2003-04-05', '2003-04-20', '2001-01-02', '2003-03-05', '2005-04-15', '2007-05-15', '2008-02-05', '2010-06-07', '2010-09-22', '2012-02-28')),
                        'date2'=as.Date(c('2003-05-06', '2003-06-20', '2002-03-04', '2007-02-22', '2014-05-19', '2008-02-05', '2012-02-14', '2011-02-14', '2014-05-19', '2013-03-04')),
                        'char'=c('E', 'R', 'M', 'I', 'C', 'I', 'M', 'V', 'P', 'R')),
                   stringsAsFactors=FALSE)

Solution

Iterate all rows (that were existing in the original data.frame) and compare them to all current lines.

nrow_init = nrow(df)
for (i in 1:(nrow(df)-1)) {
  print(i)
  ## get rows of df that have overlapping dates
  ##   (1:nrow(df))>i :: consider only rows below the current row to avoid double processing of two row-pairs
  ##   (!grepl(df$char[i],df$char)) :: prevent double letters
  ## Because we call nrow(df) each time (and not save it as a variable once in the beginning), we consider also new rows here. Therefore, we do not need the specific procedure for comparing 3 or more rows.
  loc = ((1:nrow(df))>i) & (!grepl(df$char[i],df$char)) & (df$ID[i]==df$ID) & (((df$date1[i]>df$date1) & (df$date1[i]<df$date2)) | ((df$date1>df$date1[i]) & (df$date1<df$date2[i])) | ((df$date2[i]<df$date2) & (df$date2[i]>df$date1)) | ((df$date2<df$date2[i]) & (df$date2>df$date1[i])))
  ## Uncomment this line, if you want to compare only two rows each and not more
  # loc = ((1:nrow(df))<=nrow_init) & ((1:nrow(df))>i) & (df$ID[i]==df$ID) & (((df$date1[i]>df$date1) & (df$date1[i]<df$date2)) | ((df$date2[i]<df$date2) & (df$date2[i]>df$date1)))

  ## proceed only of at least one duplicate row was found
  if (sum(loc) > 0) {
    # build new rows
    #  pmax and pmin do element-wise min and max calculation; df$date1[i] and df$date2[i] are automatically extended to the length of df$date1[loc] and df$date2[loc], respectively
    df_append = as.data.frame(list('ID'=df$ID[loc],
                                   'date1'=pmax(df$date1[i],df$date1[loc]),
                                   'date2'=pmin(df$date2[i],df$date2[loc]),
                                   'char'=paste0(df$char[i],df$char[loc])))
    ## append new rows
    df = rbind(df, df_append)
  }
}

## create a new column and sort the characters in it
##  idea for sort: https://mcmap.net/q/480684/-how-to-sort-letters-in-a-string
df$sort_char = df$char
for (i in 1:nrow(df)) df$sort_char[i] = paste(sort(unlist(strsplit(df$sort_char[i], ""))), collapse = "")
## remove duplicates
df = df[!duplicated(df[c('ID', 'date1', 'date2', 'sort_char')]),]
## remove additional column
df$sort_char = NULL

Out put

ID      date1      date2 char
15 2003-04-05 2003-05-06    E
15 2003-04-20 2003-06-20    R
16 2001-01-02 2002-03-04    M
17 2003-03-05 2007-02-22    I
17 2005-04-15 2014-05-19    C
17 2007-05-15 2008-02-05    I
17 2008-02-05 2012-02-14    M
17 2010-06-07 2011-02-14    V
17 2010-09-22 2014-05-19    P
17 2012-02-28 2013-03-04    R
15 2003-04-20 2003-05-06   ER
17 2005-04-15 2007-02-22   IC
17 2007-05-15 2008-02-05   CI
17 2008-02-05 2012-02-14   CM
17 2010-06-07 2011-02-14   CV
17 2010-09-22 2014-05-19   CP
17 2012-02-28 2013-03-04   CR
17 2010-06-07 2011-02-14   MV
17 2010-09-22 2012-02-14   MP
17 2010-06-07 2011-02-14  MCV
17 2010-09-22 2012-02-14  MCP
17 2010-09-22 2011-02-14   VP
17 2010-09-22 2011-02-14  VCP
17 2010-09-22 2011-02-14  VMP
17 2010-09-22 2011-02-14 VMCP
17 2012-02-28 2013-03-04   PR
17 2012-02-28 2013-03-04  PCR
Finn answered 12/9, 2017 at 8:40 Comment(1)
... there are missing some combination; I will soon post an improvementFinn
G
1

You can also use dplyr/tidyr along with the ivs package, which is a package dedicated to working with interval vectors like you have here. This allows you to combine your start/end dates into a single interval column and use a variety of iv_*() functions on it, here we use iv_identify_splits().

Understanding iv_identify_splits() can be a little tricky at first, so I'd encourage you to take a look at the graphical representation of that operation here

library(dplyr)
library(tidyr)
library(ivs)

df <- tribble(
  ~ID,       ~date1,       ~date2, ~char,
  15L, "2003-04-05", "2003-05-06",   "E",
  15L, "2003-04-20", "2003-06-20",   "R",
  16L, "2001-01-02", "2002-03-04",   "M",
  17L, "2003-03-05", "2007-02-22",   "I",
  17L, "2005-04-15", "2014-05-19",   "C",
  17L, "2007-05-15", "2008-02-05",   "I",
  17L, "2008-02-05", "2012-02-14",   "M",
  17L, "2010-06-07", "2011-02-14",   "V",
  17L, "2010-09-22", "2014-05-19",   "P",
  17L, "2012-02-28", "2013-03-04",   "R"
) %>%
  mutate(
    date1 = as.Date(date1),
    date2 = as.Date(date2)
  )

# Combine the start/stop endpoints into a single interval vector
df <- df %>%
  mutate(interval = iv(date1, date2), .keep = "unused")

# Note that these are half-open intervals and you may need to adjust the end!
df
#> # A tibble: 10 × 3
#>       ID char                  interval
#>    <int> <chr>               <iv<date>>
#>  1    15 E     [2003-04-05, 2003-05-06)
#>  2    15 R     [2003-04-20, 2003-06-20)
#>  3    16 M     [2001-01-02, 2002-03-04)
#>  4    17 I     [2003-03-05, 2007-02-22)
#>  5    17 C     [2005-04-15, 2014-05-19)
#>  6    17 I     [2007-05-15, 2008-02-05)
#>  7    17 M     [2008-02-05, 2012-02-14)
#>  8    17 V     [2010-06-07, 2011-02-14)
#>  9    17 P     [2010-09-22, 2014-05-19)
#> 10    17 R     [2012-02-28, 2013-03-04)

# For each ID, compute the "splits" for each interval.
# This splits on all the endpoints and returns a list column
df <- df %>%
  group_by(ID) %>%
  mutate(splits = iv_identify_splits(interval))

print(df, n = 3)
#> # A tibble: 10 × 4
#> # Groups:   ID [3]
#>      ID char                  interval           splits
#>   <int> <chr>               <iv<date>> <list<iv<date>>>
#> 1    15 E     [2003-04-05, 2003-05-06)              [2]
#> 2    15 R     [2003-04-20, 2003-06-20)              [2]
#> 3    16 M     [2001-01-02, 2002-03-04)              [1]
#> # … with 7 more rows

# Note how the total range of the splits vector matches the
# range of the corresponding interval
df$interval[[1]]
#> <iv<date>[1]>
#> [1] [2003-04-05, 2003-05-06)
df$splits[[1]]
#> <iv<date>[2]>
#> [1] [2003-04-05, 2003-04-20) [2003-04-20, 2003-05-06)

# From there we can unchop() the splits column so we can group on it
df <- df %>%
  unchop(splits)

# Note how rows 2 and 3 have the same `splits` value, so `E` and `R` will
# go together
df
#> # A tibble: 30 × 4
#> # Groups:   ID [3]
#>       ID char                  interval                   splits
#>    <int> <chr>               <iv<date>>               <iv<date>>
#>  1    15 E     [2003-04-05, 2003-05-06) [2003-04-05, 2003-04-20)
#>  2    15 E     [2003-04-05, 2003-05-06) [2003-04-20, 2003-05-06)
#>  3    15 R     [2003-04-20, 2003-06-20) [2003-04-20, 2003-05-06)
#>  4    15 R     [2003-04-20, 2003-06-20) [2003-05-06, 2003-06-20)
#>  5    16 M     [2001-01-02, 2002-03-04) [2001-01-02, 2002-03-04)
#>  6    17 I     [2003-03-05, 2007-02-22) [2003-03-05, 2005-04-15)
#>  7    17 I     [2003-03-05, 2007-02-22) [2005-04-15, 2007-02-22)
#>  8    17 C     [2005-04-15, 2014-05-19) [2005-04-15, 2007-02-22)
#>  9    17 C     [2005-04-15, 2014-05-19) [2007-02-22, 2007-05-15)
#> 10    17 C     [2005-04-15, 2014-05-19) [2007-05-15, 2008-02-05)
#> # … with 20 more rows

# Group by (ID, splits) and paste the `char` column elements together
df %>%
  group_by(ID, splits) %>%
  summarise(char = paste0(char, collapse = ","), .groups = "drop")
#> # A tibble: 15 × 3
#>       ID                   splits char   
#>    <int>               <iv<date>> <chr>  
#>  1    15 [2003-04-05, 2003-04-20) E      
#>  2    15 [2003-04-20, 2003-05-06) E,R    
#>  3    15 [2003-05-06, 2003-06-20) R      
#>  4    16 [2001-01-02, 2002-03-04) M      
#>  5    17 [2003-03-05, 2005-04-15) I      
#>  6    17 [2005-04-15, 2007-02-22) I,C    
#>  7    17 [2007-02-22, 2007-05-15) C      
#>  8    17 [2007-05-15, 2008-02-05) C,I    
#>  9    17 [2008-02-05, 2010-06-07) C,M    
#> 10    17 [2010-06-07, 2010-09-22) C,M,V  
#> 11    17 [2010-09-22, 2011-02-14) C,M,V,P
#> 12    17 [2011-02-14, 2012-02-14) C,M,P  
#> 13    17 [2012-02-14, 2012-02-28) C,P    
#> 14    17 [2012-02-28, 2013-03-04) C,P,R  
#> 15    17 [2013-03-04, 2014-05-19) C,P

Created on 2022-04-05 by the reprex package (v2.0.1)

Gaius answered 5/4, 2022 at 14:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.