Subset observations that differ by at least 30 minutes time

Asked 28/4, 2016 at 14:48 Answered 28/4, 2016 at 18:35

I have a data.table (~30 million rows) consisting of a datetime column in POSIXct format, an id column and a few other columns (in the example, I just left one irrelevant column x to demonstrate that there are other columns present that need to be kept). A dput is at the bottom of the post.

head(DT)
#              datetime          x id
#1: 2016-04-28 16:20:18 0.02461368  1
#2: 2016-04-28 16:41:34 0.88953932  1
#3: 2016-04-28 16:46:07 0.31818101  1
#4: 2016-04-28 17:00:56 0.14711365  1
#5: 2016-04-28 17:09:11 0.54406602  1
#6: 2016-04-28 17:39:09 0.69280341  1

Q: For each id, I need to subset only those observations that differ by more than 30 minutes time. What could be an efficient data.table approach to do this (if possible, without extensive looping)?

The logic can also be described as (like in my comment below):

Per id the first row is always kept. The next row that is at least 30 minutes after the first shall also be kept. Let's assume that row to be kept is row 4. Then, compute time differences between row 4 and rows 5:n and keep the first that differs by more than 30 mins and so on

In the dput below, I added a colum keep to indicate which rows should be kept in this example because they differ by more than 30 minutes from the previous observation that is kept per id. The difficulty is that it seems to be necessary to calculate the time differences iteratively (or at least, I cannot think of a more efficient approach at the moment).

library(data.table)
DT <- data.table::data.table(
  datetime = as.POSIXct(
    c(
      "2016-04-28 16:20:18.81561", "2016-04-28 16:41:34.81561",
      "2016-04-28 16:46:07.81561", "2016-04-28 17:00:56.81561",
      "2016-04-28 17:09:11.81561", "2016-04-28 17:39:09.81561",
      "2016-04-28 17:50:01.81561", "2016-04-28 17:51:46.81561",
      "2016-04-28 17:57:58.81561", "2016-04-28 17:58:23.81561",
      "2016-04-28 16:13:19.81561", "2016-04-28 16:13:44.81561",
      "2016-04-28 16:36:44.81561", "2016-04-28 16:55:31.81561",
      "2016-04-28 17:00:33.81561", "2016-04-28 17:11:51.81561",
      "2016-04-28 17:14:14.81561", "2016-04-28 17:26:17.81561",
      "2016-04-28 17:51:02.81561", "2016-04-28 17:56:36.81561"
    )
  ) |>
    structure(tzone = NULL),
  x = c(
    0.0246136845089495, 0.889539316063747, 0.318181007634848, 0.147113647311926,
    0.544066024711356, 0.6928034061566, 0.994269776623696, 0.477795971091837,
    0.231625785352662, 0.963024232536554, 0.216407935833558, 0.708530468167737,
    0.758459537522867, 0.640506813768297, 0.902299045119435, 0.28915973729454,
    0.795467417687178, 0.690705278422683, 0.59414202044718, 0.655705799115822
  ),
  id = rep(1:2, each = 10L),
  keep = c(
    TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE,
    FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE,
    TRUE
  )
)

setkey(DT, id, datetime)
DT[, difftime := difftime(datetime, shift(datetime, 1L, NA,type="lag"), units = "mins"),
   by = id]
DT[is.na(difftime), difftime := 0]
DT[, difftime := cumsum(as.numeric(difftime)), by = id]

Explanation of the keep column:

Rows 2:3 differ by less than 30 minutes from row 1 -> delete
Row 4 differs by more than 30 minutes from row 1 -> keep
Row 5 dufferes by less than 30 minutes from row 4 -> delete
Row 6 differs by more than 30 minutes from row 4 -> keep
...

Desired output:

desiredDT <- DT[(keep)]

Thanks for three expert answers I received. I tested them on 1 and 10 million rows of data. Here's an excerpt of the benchmarks.

a) 1 million rows

microbenchmark(frank(DT_Frank), roland(DT_Roland), eddi1(DT_Eddi1), eddi2(DT_Eddi2), 
               times = 3L, unit = "relative")
#Unit: relative
#              expr       min        lq      mean    median        uq      max neval
#   frank(DT_Frank)  1.286647  1.277104  1.185216  1.267769  1.140614 1.036749     3
# roland(DT_Roland)  1.000000  1.000000  1.000000  1.000000  1.000000 1.000000     3
#   eddi1(DT_Eddi1) 11.748622 11.697409 10.941792 11.647320 10.587002 9.720901     3
#   eddi2(DT_Eddi2)  9.966078  9.915651  9.210168  9.866330  8.877769 8.070281     3

b) 10 million rows

microbenchmark(frank(DT_Frank), roland(DT_Roland), eddi1(DT_Eddi1), eddi2(DT_Eddi2), 
                times = 3L, unit = "relative")
#Unit: relative
#              expr       min        lq      mean    median        uq       max neval
#   frank(DT_Frank)  1.019561  1.025427  1.026681  1.031061  1.030028  1.029037     3
# roland(DT_Roland)  1.000000  1.000000  1.000000  1.000000  1.000000  1.000000     3
#   eddi1(DT_Eddi1) 11.567302 11.443146 11.301487 11.323914 11.176515 11.035143     3
#   eddi2(DT_Eddi2)  9.796800  9.693823  9.526193  9.594931  9.398969  9.211019     3

Apparently, @Frank's data.table approach and @Roland's Rcpp based solution are similar in performance with Rcpp having a slight advantage, while @eddi's approaches were still fast but not as performant as the others.

However, when I checked for equality of the solutions, I found that @Roland's approach has a slightly different result than the others:

a) 1 million rows

all.equal(frank(DT_Frank), roland(DT_Roland))
#[1] "Component “datetime”: Numeric: lengths (982228, 982224) differ"
#[2] "Component “id”: Numeric: lengths (982228, 982224) differ"      
#[3] "Component “x”: Numeric: lengths (982228, 982224) differ"
all.equal(frank(DT_Frank), eddi1(DT_Eddi1))
#[1] TRUE
all.equal(frank(DT_Frank), eddi2(DT_Eddi2))
#[1] TRUE

b) 10 million rows

all.equal(frank(DT_Frank), roland(DT_Roland))
#[1] "Component “datetime”: Numeric: lengths (9981898, 9981891) differ"
#[2] "Component “id”: Numeric: lengths (9981898, 9981891) differ"      
#[3] "Component “x”: Numeric: lengths (9981898, 9981891) differ"       
all.equal(frank(DT_Frank), eddi1(DT_Eddi1))
#[1] TRUE
all.equal(frank(DT_Frank), eddi2(DT_Eddi2))
#[1] TRUE

My current assumption is that this difference might be related to whether the differnce is > 30 minutes or >= 30 minutes though I'm not sure about that yet.

Final thought: I decided to go with @Frank's solution for two reasons: 1. it performs very well, almost equal to the Rcpp solution, and 2. it doesn't require another package with which I'm not very familiar yet (I'm using data.table anyway)

Barfly answered 28/4, 2016 at 14:48 Comment(7)

These are the kinds of tasks in which I think a good C/C++ solution is valuable. There is not an obvious R-vectorized way and writing the conditions you described should be pretty simple in C or C++. If you know a little how to write C/C++ functions callable from R, I'd suggest that route. – Demanding 28/4, 2016 at 15:16

should be easy with overlapping join, just prepare the from and to dates for each id – Ramberg 28/4, 2016 at 15:50

@Ramberg I was hoping that such a solution exists but I don't know the from and to dates apriori. Or do you have an idea how to compute them? – Barfly 28/4, 2016 at 15:58

differ by more than 30 minutes time - differs from what precisely? – Ramberg 28/4, 2016 at 16:15

I aggree with @nicola. This would be simple with Rcpp. It's not worth hurting your brain to come up with an R solution. – Grozny 28/4, 2016 at 16:18

@Ramberg per id the first row is always kept. The next row that is at least 30 minutes after the first shall also be kept. Let's assume that row to be kept is row 4. Then, compute time differences between row 4 and rows 5:n and keep the first that differs by more than 30 mins and so on – Barfly 28/4, 2016 at 16:25

@Grozny and nicola thanks, I will definitely consider that. In case you have an idea of an Rcpp based solution it would be great to see that – Barfly 28/4, 2016 at 16:27

Here's what I would do:

setDT(DT, key=c("id","datetime")) # invalid selfref with the OP's example data

s = 0L
w = DT[, .I[1L], by=id]$V1

while (length(w)){
   s = s + 1L
   DT[w, tag := s]

   m = DT[w, .(id, datetime = datetime+30*60)]
   w = DT[m, which = TRUE, roll=-Inf]
   w = w[!is.na(w)]
}

which gives

               datetime          x id  keep tag
 1: 2016-04-28 10:20:18 0.02461368  1  TRUE   1
 2: 2016-04-28 10:41:34 0.88953932  1 FALSE  NA
 3: 2016-04-28 10:46:07 0.31818101  1 FALSE  NA
 4: 2016-04-28 11:00:56 0.14711365  1  TRUE   2
 5: 2016-04-28 11:09:11 0.54406602  1 FALSE  NA
 6: 2016-04-28 11:39:09 0.69280341  1  TRUE   3
 7: 2016-04-28 11:50:01 0.99426978  1 FALSE  NA
 8: 2016-04-28 11:51:46 0.47779597  1 FALSE  NA
 9: 2016-04-28 11:57:58 0.23162579  1 FALSE  NA
10: 2016-04-28 11:58:23 0.96302423  1 FALSE  NA
11: 2016-04-28 10:13:19 0.21640794  2  TRUE   1
12: 2016-04-28 10:13:44 0.70853047  2 FALSE  NA
13: 2016-04-28 10:36:44 0.75845954  2 FALSE  NA
14: 2016-04-28 10:55:31 0.64050681  2  TRUE   2
15: 2016-04-28 11:00:33 0.90229905  2 FALSE  NA
16: 2016-04-28 11:11:51 0.28915974  2 FALSE  NA
17: 2016-04-28 11:14:14 0.79546742  2 FALSE  NA
18: 2016-04-28 11:26:17 0.69070528  2  TRUE   3
19: 2016-04-28 11:51:02 0.59414202  2 FALSE  NA
20: 2016-04-28 11:56:36 0.65570580  2  TRUE   4

The idea behind it is described by the OP in a comment:

per id the first row is always kept. The next row that is at least 30 minutes after the first shall also be kept. Let's assume that row to be kept is row 4. Then, compute time differences between row 4 and rows 5:n and keep the first that differs by more than 30 mins and so on

Faulkner answered 28/4, 2016 at 16:29 Comment(1)

Thanks Frank, that looks very promising. I will try it as soon as I'm home (+1) – Barfly 28/4, 2016 at 16:35

Using Rcpp:

library(Rcpp)
library(inline)
cppFunction(
  'LogicalVector selecttimes(const NumericVector x) {
   const int n = x.length();
   LogicalVector res(n);
   res(0) = true;
   double testval = x(0);
   for (int i=1; i<n; i++) {
    if (x(i) - testval > 30 * 60) {
      testval = x(i);
      res(i) = true;
    }
   }
   return res;
  }')

DT[, keep1 := selecttimes(datetime), by = id]

DT[, all(keep == keep1)]
#[1] TRUE

Some additional testing should be done, it needs input validation, and the time difference could be made a parameter.

Grozny answered 28/4, 2016 at 16:40 Comment(2)

Thanks Roland, this also looks like a very good approach!Now that I have two solutions I will test both on the real data tomorrow (haven't got it at home) +1 – Barfly 28/4, 2016 at 16:45

Minor thing: you can define testval = x(i) + 30*60 and test x(i) > testval for fewer computations. (Not sure if there's some edge case I'm missing.) – Faulkner 28/4, 2016 at 16:57

# create an index column
DT[, idx := 1:.N, by = id]

# find the indices of the matching future dates
DT[, fut.idx := DT[.(id = id, datetime = datetime+30*60), on = c('id', 'datetime')
                    , idx, roll = -Inf]]
#               datetime          x id  keep         difftime idx  fut.idx
# 1: 2016-04-28 09:20:18 0.02461368  1  TRUE   0.0000000 mins   1        4
# 2: 2016-04-28 09:41:34 0.88953932  1 FALSE  21.2666667 mins   2        6
# 3: 2016-04-28 09:46:07 0.31818101  1 FALSE  25.8166667 mins   3        6
# 4: 2016-04-28 10:00:56 0.14711365  1  TRUE  40.6333333 mins   4        6
# 5: 2016-04-28 10:09:11 0.54406602  1 FALSE  48.8833333 mins   5        7
# 6: 2016-04-28 10:39:09 0.69280341  1  TRUE  78.8500000 mins   6       NA
# 7: 2016-04-28 10:50:01 0.99426978  1 FALSE  89.7166667 mins   7       NA
# 8: 2016-04-28 10:51:46 0.47779597  1 FALSE  91.4666667 mins   8       NA
# 9: 2016-04-28 10:57:58 0.23162579  1 FALSE  97.6666667 mins   9       NA
#10: 2016-04-28 10:58:23 0.96302423  1 FALSE  98.0833333 mins  10       NA
#11: 2016-04-28 09:13:19 0.21640794  2  TRUE   0.0000000 mins   1        4
#12: 2016-04-28 09:13:44 0.70853047  2 FALSE   0.4166667 mins   2        4
#13: 2016-04-28 09:36:44 0.75845954  2 FALSE  23.4166667 mins   3        6
#14: 2016-04-28 09:55:31 0.64050681  2  TRUE  42.2000000 mins   4        8
#15: 2016-04-28 10:00:33 0.90229905  2 FALSE  47.2333333 mins   5        9
#16: 2016-04-28 10:11:51 0.28915974  2 FALSE  58.5333333 mins   6        9
#17: 2016-04-28 10:14:14 0.79546742  2 FALSE  60.9166667 mins   7        9
#18: 2016-04-28 10:26:17 0.69070528  2  TRUE  72.9666667 mins   8       10
#19: 2016-04-28 10:51:02 0.59414202  2 FALSE  97.7166667 mins   9       NA
#20: 2016-04-28 10:56:36 0.65570580  2  TRUE 103.2833333 mins  10       NA


# at this point the problem is "solved", but you still have to extract the solution
# and that's the more complicated part
DT[, keep.new := FALSE]

# iterate over the matching indices (jumping straight to the correct one)
DT[, {
       next.idx = 1

       while(!is.na(next.idx)) {
         set(DT, .I[next.idx], 'keep.new', TRUE)
         next.idx = fut.idx[next.idx]
       }
     }, by = id]

DT[, identical(keep, keep.new)]
#[1] TRUE

Alternatively for the last step, you can do (this will iterate over the entire thing, but I don't know what the speed impact would be):

DT[, keep.3 := FALSE]
DT[DT[, .I[na.omit(Reduce(function(x, y) fut.idx[x], c(1, fut.idx), accumulate = T))]
      , by = id]$V1
   , keep.3 := TRUE]

DT[, identical(keep, keep.3)]
#[1] TRUE

Pontonier answered 28/4, 2016 at 18:35 Comment(4)

On 1.9.7, DT[, idx := rowid(id)] I guess. Yeah, DT[, {...set(DT,...)...}] looks strange to me. – Faulkner 28/4, 2016 at 18:41

@Faulkner I don't think so - I couldn't figure out a way to do the jumps without a simple loop – Pontonier 28/4, 2016 at 18:43

Yeah, I just noticed my guess was wrong.. seems like there should be another way, though. – Faulkner 28/4, 2016 at 18:44

Thanks for your answer eddi! Another promising approach I will include in my test tomorrow. +1 – Barfly 28/4, 2016 at 19:5

Recommended topics

Hot tags