How to solve gaps and island problems in R and performance vs SQL?
Asked Answered
F

1

12

I was wondering whether the island and gaps problems can be solved in R efficiently, similar to SQL. I have the following data available, if we examine one ID:

ID StartDate  StartTime EndDate      EndTime 
1  19-05-2014 19:00     19-05-2014   20:00
1  19-05-2014 19:30     19-05-2014   23:30
1  19-05-2014 16:00     19-05-2014   18:00
1  20-05-2014 20:00     20-05-2014   20:30

Notice that the first two rows overlap, what I would like to do, is merge the overlapping rows, resulting:

ID StartDate  StartTime EndDate      EndTime 
1  19-05-2014 19:00     19-05-2014   23:30
1  19-05-2014 16:00     19-05-2014   18:00
1  20-05-2014 20:00     20-05-2014   20:30

Is there a way to do this in R?

I am well aware this is done in SQL, but since my data is already in R, I prefer to do this in R. Second, I have some questions regarding the performance of finding gaps and islands, I know that SQL is very fast in doing that, but I wonder whether R is faster due to all the data being in memory.

I would like to use data.table to do this, but I don't know how.

UPDATE - Response to Arun

I have created the following test case that contains every possible interval orientation.

dat <- structure(
  list(ID = c(1L, 1L, 1L, 1L, 1L, 1L), 
       stime = structure(c(as.POSIXct("2014-01-15 08:00:00"),
                           as.POSIXct("2014-01-15 10:00:00"),
                           as.POSIXct("2014-01-15 08:30:00"),
                           as.POSIXct("2014-01-15 09:00:00"),
                           as.POSIXct("2014-01-15 11:30:00"),
                           as.POSIXct("2014-01-15 12:00:00")),
                         class = c("POSIXct", "POSIXt"), tzone = ""),
       etime = structure(c(as.POSIXct("2014-01-15 09:30:00"),
                           as.POSIXct("2014-01-15 11:00:00"),
                           as.POSIXct("2014-01-15 10:00:00"), 
                           as.POSIXct("2014-01-15 09:30:00"),
                           as.POSIXct("2014-01-15 12:30:00"),
                           as.POSIXct("2014-01-15 13:00:00")), 
                         class = c("POSIXct", "POSIXt"), tzone = "")
  ),
  .Names = c("ID", "stime", "etime"),
  sorted = c("ID", "stime", "etime"),
  class = c("data.table", "data.frame"),
  row.names = c(NA,-6L)
)

I would expect that the interval from 8:30 - 10:00 will be "glued" onto 10:00 - 11:00, but that was not the case. The result was:

   idx ID               stime               etime
1:   4  1 2014-01-15 08:00:00 2014-01-15 10:00:00
2:   3  1 2014-01-15 10:00:00 2014-01-15 11:00:00
3:   6  1 2014-01-15 11:30:00 2014-01-15 13:00:00

The following data set provides a more thorough testing:

# The numbers represent seconds from 1970-01-01 01:00:01
dat <- structure(
  list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), 
       stime = structure(c(as.POSIXct("2014-01-15 08:00:00"),
                           as.POSIXct("2014-01-15 10:00:00"),
                           as.POSIXct("2014-01-15 08:30:00"),
                           as.POSIXct("2014-01-15 09:00:00"),
                           as.POSIXct("2014-01-15 11:30:00"),
                           as.POSIXct("2014-01-15 12:00:00"),
                           as.POSIXct("2014-01-15 07:30:00"),
                           as.POSIXct("2014-01-15 08:00:00"),
                           as.POSIXct("2014-01-15 08:30:00"),
                           as.POSIXct("2014-01-15 09:00:00"),
                           as.POSIXct("2014-01-15 09:00:00"),
                           as.POSIXct("2014-01-15 09:30:00"),
                           as.POSIXct("2014-01-15 10:00:00")
                           ),
                         class = c("POSIXct", "POSIXt"), tzone = ""),
       etime = structure(c(as.POSIXct("2014-01-15 09:30:00"),
                           as.POSIXct("2014-01-15 11:00:00"),
                           as.POSIXct("2014-01-15 10:00:00"), 
                           as.POSIXct("2014-01-15 09:30:00"),
                           as.POSIXct("2014-01-15 12:30:00"),
                           as.POSIXct("2014-01-15 13:00:00"),
                           as.POSIXct("2014-01-15 08:30:00"),
                           as.POSIXct("2014-01-15 09:00:00"),
                           as.POSIXct("2014-01-15 09:30:00"),
                           as.POSIXct("2014-01-15 10:00:00"),
                           as.POSIXct("2014-01-15 10:00:00"),
                           as.POSIXct("2014-01-15 10:30:00"),
                           as.POSIXct("2014-01-15 11:00:00")
                           ), 
                         class = c("POSIXct", "POSIXt"), tzone = "")
  ),
  .Names = c("ID", "stime", "etime"),
  sorted = c("ID", "stime", "etime"),
  class = c("data.table", "data.frame"),
  row.names = c(NA,-6L)
)

So our result is:

   idx ID               stime               etime
1:   4  1 2014-01-15 08:00:00 2014-01-15 10:00:00
2:   3  1 2014-01-15 10:00:00 2014-01-15 11:00:00
3:   6  1 2014-01-15 11:30:00 2014-01-15 13:00:00
4:  12  2 2014-01-15 07:30:00 2014-01-15 09:30:00
5:  13  2 2014-01-15 09:00:00 2014-01-15 11:00:00

Now for respondent with ID=2, we see that the intervals are overlapping, but not reported as one interval. The correct solution would be:

   idx ID               stime               etime
1:   ?  1 2014-01-15 08:00:00 2014-01-15 11:00:00
3:   ?  1 2014-01-15 11:30:00 2014-01-15 13:00:00
4:  ??  2 2014-01-15 07:30:00 2014-01-15 11:00:00

Update - Benchmarks and testing and large datasets

I have the following dataset with about 1000 users, each having 500 durations, giving 0.5 million rows. You can download the dataset at my Google Drive, including the solution in Google Drive.

SQL Server 2014 on a laptop of 8GB RAM, 64-bit, i5-4210U CPU @ 1.70Ghz - 2.39Ghz takes about 5 seconds to do this using the solution provided by Itzik Ben-Gan in SQL. The 5 seconds are excluding the process of creating a function. In addition, no indices are created for any table whatsoever.

PS: I use library(lubridate);

Filomenafiloplume answered 3/6, 2015 at 20:19 Comment(3)
what's the SQL solution?Galligan
Dear Eddi, here are some examples of that problem: sqlmag.com/blog/…. Itzik Ben-Gan was able to do that in an impressive 2 seconds.Filomenafiloplume
@alexis_laz, yes it would work here, but POSIXct can also have milliseconds in them, and that'd fail (as IRanges::reduce would implicitly convert it to integer ranges)..Thadeus
G
10

Here's a very simple idea. Order by start time, then find cumulative max of end time. Once you've done that, the overlap groups are simply those where the next start time is still less than or equal to current cumulative max end time (all done by ID):

setorder(dat, ID, stime) # ordering by ID is unnecessary, it's just prettier

dat[, etime.max := as.POSIXct(cummax(as.numeric(etime)), origin = '1970-01-01'), by = ID]

# find the grouping of intervals (1:.N hack is to avoid warnings when .N=1)
dat[, grp := cumsum(c(FALSE, stime[2:.N] > etime.max[1:(.N-1)]))[1:.N], by = ID]

dat[, .(stime = min(stime), etime = max(etime)), by = .(ID, grp)][, grp := NULL][]
#   ID               stime               etime
#1:  1 2014-01-15 08:00:00 2014-01-15 11:00:00
#2:  1 2014-01-15 11:30:00 2014-01-15 13:00:00
#3:  2 2014-01-15 07:30:00 2014-01-15 11:00:00

Since this doesn't need to find all possible overlaps, it's very fast. On a simulated data set that roughly matches OP's description it's instantaneous for me (< 0.2s).

Galligan answered 3/6, 2015 at 21:14 Comment(7)
This might sound stupid of me, but could you elaborate on what you mean with cumulative max of end time? (I am not very good with data.table yet, but I am eager to learn how the 3rd line of your code works).Filomenafiloplume
I mean maximum end time up to current point. Check out ?cummax for an example. As for the 3rd line, check out what each piece separately computes (for a single ID), and then it'll make sense. It might also help for visualization to convert etime.max to POSIXct (actually, I just edited to do that, makes it easier to read).Galligan
Thanks eddi, I understand your code now, but I am wondering what ".(stuff)" is called. I thought it should be in lists, but I can't find anything regarding the dot parentheses notation. Or maybe I have overlooked it. It seems like a very elegant solution that might work. I will check it with the full 500 000 rows that I use for testing and report back :).Filomenafiloplume
The dot is equivalent to writing list and imo is much more readable. I want to say it was introduced recently, since I started doing it recently, but I don't remember for sure. Cool, I look forward to the testing :)Galligan
@Galligan data.table level master :)Endrin
This is mucccccch nicer, avoiding overlaps altogether!Thadeus
Thanks Eddi, it was way faster than SQL. I got 1 second for the testing data.Filomenafiloplume

© 2022 - 2024 — McMap. All rights reserved.