I was wondering whether the island and gaps problems can be solved in R efficiently, similar to SQL. I have the following data available, if we examine one ID:
ID StartDate StartTime EndDate EndTime
1 19-05-2014 19:00 19-05-2014 20:00
1 19-05-2014 19:30 19-05-2014 23:30
1 19-05-2014 16:00 19-05-2014 18:00
1 20-05-2014 20:00 20-05-2014 20:30
Notice that the first two rows overlap, what I would like to do, is merge the overlapping rows, resulting:
ID StartDate StartTime EndDate EndTime
1 19-05-2014 19:00 19-05-2014 23:30
1 19-05-2014 16:00 19-05-2014 18:00
1 20-05-2014 20:00 20-05-2014 20:30
Is there a way to do this in R?
I am well aware this is done in SQL, but since my data is already in R, I prefer to do this in R. Second, I have some questions regarding the performance of finding gaps and islands, I know that SQL is very fast in doing that, but I wonder whether R is faster due to all the data being in memory.
I would like to use data.table
to do this, but I don't know how.
UPDATE - Response to Arun
I have created the following test case that contains every possible interval orientation.
dat <- structure(
list(ID = c(1L, 1L, 1L, 1L, 1L, 1L),
stime = structure(c(as.POSIXct("2014-01-15 08:00:00"),
as.POSIXct("2014-01-15 10:00:00"),
as.POSIXct("2014-01-15 08:30:00"),
as.POSIXct("2014-01-15 09:00:00"),
as.POSIXct("2014-01-15 11:30:00"),
as.POSIXct("2014-01-15 12:00:00")),
class = c("POSIXct", "POSIXt"), tzone = ""),
etime = structure(c(as.POSIXct("2014-01-15 09:30:00"),
as.POSIXct("2014-01-15 11:00:00"),
as.POSIXct("2014-01-15 10:00:00"),
as.POSIXct("2014-01-15 09:30:00"),
as.POSIXct("2014-01-15 12:30:00"),
as.POSIXct("2014-01-15 13:00:00")),
class = c("POSIXct", "POSIXt"), tzone = "")
),
.Names = c("ID", "stime", "etime"),
sorted = c("ID", "stime", "etime"),
class = c("data.table", "data.frame"),
row.names = c(NA,-6L)
)
I would expect that the interval from 8:30 - 10:00 will be "glued" onto 10:00 - 11:00, but that was not the case. The result was:
idx ID stime etime
1: 4 1 2014-01-15 08:00:00 2014-01-15 10:00:00
2: 3 1 2014-01-15 10:00:00 2014-01-15 11:00:00
3: 6 1 2014-01-15 11:30:00 2014-01-15 13:00:00
The following data set provides a more thorough testing:
# The numbers represent seconds from 1970-01-01 01:00:01
dat <- structure(
list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L),
stime = structure(c(as.POSIXct("2014-01-15 08:00:00"),
as.POSIXct("2014-01-15 10:00:00"),
as.POSIXct("2014-01-15 08:30:00"),
as.POSIXct("2014-01-15 09:00:00"),
as.POSIXct("2014-01-15 11:30:00"),
as.POSIXct("2014-01-15 12:00:00"),
as.POSIXct("2014-01-15 07:30:00"),
as.POSIXct("2014-01-15 08:00:00"),
as.POSIXct("2014-01-15 08:30:00"),
as.POSIXct("2014-01-15 09:00:00"),
as.POSIXct("2014-01-15 09:00:00"),
as.POSIXct("2014-01-15 09:30:00"),
as.POSIXct("2014-01-15 10:00:00")
),
class = c("POSIXct", "POSIXt"), tzone = ""),
etime = structure(c(as.POSIXct("2014-01-15 09:30:00"),
as.POSIXct("2014-01-15 11:00:00"),
as.POSIXct("2014-01-15 10:00:00"),
as.POSIXct("2014-01-15 09:30:00"),
as.POSIXct("2014-01-15 12:30:00"),
as.POSIXct("2014-01-15 13:00:00"),
as.POSIXct("2014-01-15 08:30:00"),
as.POSIXct("2014-01-15 09:00:00"),
as.POSIXct("2014-01-15 09:30:00"),
as.POSIXct("2014-01-15 10:00:00"),
as.POSIXct("2014-01-15 10:00:00"),
as.POSIXct("2014-01-15 10:30:00"),
as.POSIXct("2014-01-15 11:00:00")
),
class = c("POSIXct", "POSIXt"), tzone = "")
),
.Names = c("ID", "stime", "etime"),
sorted = c("ID", "stime", "etime"),
class = c("data.table", "data.frame"),
row.names = c(NA,-6L)
)
So our result is:
idx ID stime etime
1: 4 1 2014-01-15 08:00:00 2014-01-15 10:00:00
2: 3 1 2014-01-15 10:00:00 2014-01-15 11:00:00
3: 6 1 2014-01-15 11:30:00 2014-01-15 13:00:00
4: 12 2 2014-01-15 07:30:00 2014-01-15 09:30:00
5: 13 2 2014-01-15 09:00:00 2014-01-15 11:00:00
Now for respondent with ID=2, we see that the intervals are overlapping, but not reported as one interval. The correct solution would be:
idx ID stime etime
1: ? 1 2014-01-15 08:00:00 2014-01-15 11:00:00
3: ? 1 2014-01-15 11:30:00 2014-01-15 13:00:00
4: ?? 2 2014-01-15 07:30:00 2014-01-15 11:00:00
Update - Benchmarks and testing and large datasets
I have the following dataset with about 1000 users, each having 500 durations, giving 0.5 million rows. You can download the dataset at my Google Drive, including the solution in Google Drive.
SQL Server 2014 on a laptop of 8GB RAM, 64-bit, i5-4210U CPU @ 1.70Ghz - 2.39Ghz takes about 5 seconds to do this using the solution provided by Itzik Ben-Gan in SQL. The 5 seconds are excluding the process of creating a function. In addition, no indices are created for any table whatsoever.
PS: I use library(lubridate);