Similar to this question, I'd like to find the duration of overlap between pairs of timestamps using data.table
.
Here's my current code:
library(data.table)
DT <- fread(
"stage,ID,date1,date2
1,A,2018-04-17 00:00:00,2018-04-17 01:00:00
1,B,2018-04-17 00:00:00,2018-04-17 00:20:00
1,C,2018-04-17 00:15:00,2018-04-17 01:00:00
2,B,2018-04-17 00:30:00,2018-04-17 01:10:00
2,D,2018-04-17 00:30:00,2018-04-17 00:50:00",
sep = ","
)
cols <- c("date1", "date2")
DT[, (cols) := lapply(.SD, as.POSIXct), .SDcols = cols]
breaks <- DT[, {
tmp <- unique(sort(c(date1, date2)))
.(start = head(tmp, -1L), end = tail(tmp, -1L))
}, by = stage]
result <- DT[breaks, on = .(stage, date1 <= start, date2 >= end), paste(ID, collapse = "+"),
by = .EACHI, allow.cartesian = T] %>%
mutate(lengthinseconds = as.numeric(difftime(date2, date1, units = "secs")))
Which returns:
stage date1 date2 V1 lengthinseconds
1 1 2018-04-17 00:00:00 2018-04-17 00:15:00 B+A 900
2 1 2018-04-17 00:15:00 2018-04-17 00:20:00 B+A+C 300
3 1 2018-04-17 00:20:00 2018-04-17 01:00:00 A+C 2400
4 2 2018-04-17 00:30:00 2018-04-17 00:50:00 D+B 1200
5 2 2018-04-17 00:50:00 2018-04-17 01:10:00 B 1200
But I'd like to return only overlaps between user dyads (i.e. no more than two overlapping users). There are several hacky ways I can think of achieve this, such as:
library(dplyr)
library(tidyr)
result %>%
filter(nchar(V1)==3) %>%
tidyr::separate(V1, c("ID1", "ID2"))
Which returns:
stage date1 date2 ID1 ID2 lengthinseconds
1 1 2018-04-17 00:00:00 2018-04-17 00:15:00 B A 900
2 1 2018-04-17 00:20:00 2018-04-17 01:00:00 A C 2400
3 2 2018-04-17 00:30:00 2018-04-17 00:50:00 D B 1200
But this seems inelegant, especially when dealing with longer ID
strings and potentially hundreds of ID
s per overlap.
Ideally, I'd like to know if there's a way to modify the original data.table
code to return this directly.
ID1
andID2
alphabetically within each row further downstream but seems easier to sort before thepaste
command. – Infracostal