Collapse and merge overlapping time intervals
Asked Answered
K

4

7

I am developing a tidyverse-based data workflow, and came across a situation where I have a data frame with lots of time intervals. Let's call the data frame my_time_intervals, and it can be reproduced like this:

library(tidyverse)
library(lubridate)

my_time_intervals <- tribble(
    ~id, ~group, ~start_time, ~end_time,
    1L, 1L, ymd_hms("2018-04-12 11:15:03"), ymd_hms("2018-05-14 02:32:10"),
    2L, 1L, ymd_hms("2018-07-04 02:53:20"), ymd_hms("2018-07-14 18:09:01"),
    3L, 1L, ymd_hms("2018-05-07 13:02:04"), ymd_hms("2018-05-23 08:13:06"),
    4L, 2L, ymd_hms("2018-02-28 17:43:29"), ymd_hms("2018-04-20 03:48:40"),
    5L, 2L, ymd_hms("2018-04-20 01:19:52"), ymd_hms("2018-08-12 12:56:37"),
    6L, 2L, ymd_hms("2018-04-18 20:47:22"), ymd_hms("2018-04-19 16:07:29"),
    7L, 2L, ymd_hms("2018-10-02 14:08:03"), ymd_hms("2018-11-08 00:01:23"),
    8L, 3L, ymd_hms("2018-03-11 22:30:51"), ymd_hms("2018-10-20 21:01:42")
)

Here's a tibble view of the same data frame:

> my_time_intervals
# A tibble: 8 x 4
     id group start_time          end_time           
  <int> <int> <dttm>              <dttm>             
1     1     1 2018-04-12 11:15:03 2018-05-14 02:32:10
2     2     1 2018-07-04 02:53:20 2018-07-14 18:09:01
3     3     1 2018-05-07 13:02:04 2018-05-23 08:13:06
4     4     2 2018-02-28 17:43:29 2018-04-20 03:48:40
5     5     2 2018-04-20 01:19:52 2018-08-12 12:56:37
6     6     2 2018-04-18 20:47:22 2018-04-19 16:07:29
7     7     2 2018-10-02 14:08:03 2018-11-08 00:01:23
8     8     3 2018-03-11 22:30:51 2018-10-20 21:01:42

A few notes about my_time_intervals:

  1. The data is divided into three groups via the group variable.

  2. The id variable is just a unique ID for each row in the data frame.

  3. The start and end of time intervals are stored in start_time and end_time in lubridate form.

  4. Some time intervals overlap, some don't, and they are not always in order. For example, row 1 overlaps with row 3, but neither of them overlaps with row 2.

  5. More than two intervals may overlap with each other, and some intervals fall completely within others. See rows 4 through 6 in group == 2.

What I want is that within each group, collapse any overlapping time intervals into contiguous intervals. In this case, my desired result would look like:

# A tibble: 5 x 4
     id group start_time          end_time           
  <int> <int> <dttm>              <dttm>             
1     1     1 2018-04-12 11:15:03 2018-05-23 08:13:06
2     2     1 2018-07-04 02:53:20 2018-07-14 18:09:01
3     4     2 2018-02-28 17:43:29 2018-08-12 12:56:37
4     7     2 2018-10-02 14:08:03 2018-11-08 00:01:23
5     8     3 2018-03-11 22:30:51 2018-10-20 21:01:42

Notice that time intervals that overlap between different groups are not merged. Also, I don't care about what happens to the id column at this point.

I know that the lubridate package includes interval-related functions, but I can't figure out how to apply them to this use case.

How can I achieve this?

Kilter answered 8/11, 2018 at 17:46 Comment(3)
my_time_intervals %>% group_by(group) %>% arrange(start_time) %>% mutate(indx = c(0, cumsum(as.numeric(lead(start_time)) > cummax(as.numeric(end_time)))[-n()])) %>% group_by(group, indx) %>% summarise(start_time = first(start_time), end_time = last(end_time)) %>% select(-indx)Diplostemonous
Thank you @Masoud for the suggestion. I'm not sure what the code means, but I tried it and the results don't match my desired output in the question (I'll append the incorrect output with your code to the question so you can see it). Can you explain what your code does? Thank you!Kilter
You missed arrange. It works perfectly.Diplostemonous
D
11
my_time_intervals %>% 
  group_by(group) %>% arrange(start_time, by_group = TRUE) %>% 
  mutate(indx = c(0, cumsum(as.numeric(lead(start_time)) >
                              cummax(as.numeric(end_time)))[-n()])) %>%
  group_by(group, indx) %>%
  summarise(start_time = min(start_time), 
            end_time = max(end_time)) %>%
  select(-indx)


# # A tibble: 5 x 3
# # Groups:   group [3]
# group start_time          end_time           
# <int> <dttm>              <dttm>             
# 1     1 2018-04-12 11:15:03 2018-05-23 08:13:06
# 2     1 2018-07-04 02:53:20 2018-07-14 18:09:01
# 3     2 2018-02-28 17:43:29 2018-08-12 12:56:37
# 4     2 2018-10-02 14:08:03 2018-11-08 00:01:23
# 5     3 2018-03-11 22:30:51 2018-10-20 21:01:42

Explanation per OP's request:

I am making another dataset which has more overlapping times within each group so the solution would get more exposure and hopefully will be grasped better;

my_time_intervals <- tribble(
  ~id, ~group, ~start_time, ~end_time,
  1L, 1L, ymd_hms("2018-04-12 11:15:03"), ymd_hms("2018-05-14 02:32:10"),
  2L, 1L, ymd_hms("2018-07-04 02:53:20"), ymd_hms("2018-07-14 18:09:01"),
  3L, 1L, ymd_hms("2018-07-05 02:53:20"), ymd_hms("2018-07-14 18:09:01"),
  4L, 1L, ymd_hms("2018-07-15 02:53:20"), ymd_hms("2018-07-16 18:09:01"),
  5L, 1L, ymd_hms("2018-07-15 01:53:20"), ymd_hms("2018-07-19 18:09:01"),
  6L, 1L, ymd_hms("2018-07-20 02:53:20"), ymd_hms("2018-07-22 18:09:01"),
  7L, 1L, ymd_hms("2018-05-07 13:02:04"), ymd_hms("2018-05-23 08:13:06"),
  8L, 1L, ymd_hms("2018-05-10 13:02:04"), ymd_hms("2018-05-23 08:13:06"),
  9L, 2L, ymd_hms("2018-02-28 17:43:29"), ymd_hms("2018-04-20 03:48:40"),
  10L, 2L, ymd_hms("2018-04-20 01:19:52"), ymd_hms("2018-08-12 12:56:37"),
  11L, 2L, ymd_hms("2018-04-18 20:47:22"), ymd_hms("2018-04-19 16:07:29"),
  12L, 2L, ymd_hms("2018-10-02 14:08:03"), ymd_hms("2018-11-08 00:01:23"),
  13L, 3L, ymd_hms("2018-03-11 22:30:51"), ymd_hms("2018-10-20 21:01:42")
)

So let's look at the indx column for this dataset. I am adding arrange by group column to see all the same grouped rows together; but, as you know because we have group_by(group) we do not actually need that.

my_time_intervals %>% 
  group_by(group) %>% arrange(group,start_time) %>% 
  mutate(indx = c(0, cumsum(as.numeric(lead(start_time)) >
                              cummax(as.numeric(end_time)))[-n()]))


  # # A tibble: 13 x 5
  # # Groups:   group [3]
  # id group start_time          end_time             indx
  # <int> <int> <dttm>              <dttm>              <dbl>
  # 1     1      1 2018-04-12 11:15:03 2018-05-14 02:32:10     0
  # 2     7      1 2018-05-07 13:02:04 2018-05-23 08:13:06     0
  # 3     8      1 2018-05-10 13:02:04 2018-05-23 08:13:06     0
  # 4     2      1 2018-07-04 02:53:20 2018-07-14 18:09:01     1
  # 5     3      1 2018-07-05 02:53:20 2018-07-14 18:09:01     1
  # 6     5      1 2018-07-15 01:53:20 2018-07-19 18:09:01     2
  # 7     4      1 2018-07-15 02:53:20 2018-07-16 18:09:01     2
  # 8     6      1 2018-07-20 02:53:20 2018-07-22 18:09:01     3
  # 9     9      2 2018-02-28 17:43:29 2018-04-20 03:48:40     0
  # 10    11     2 2018-04-18 20:47:22 2018-04-19 16:07:29     0
  # 11    10     2 2018-04-20 01:19:52 2018-08-12 12:56:37     0
  # 12    12     2 2018-10-02 14:08:03 2018-11-08 00:01:23     1
  # 13    13     3 2018-03-11 22:30:51 2018-10-20 21:01:42     0

As you can see, in the group one we have 3 distinct period of times with overlapping datapoints and one datapoint which has no overlapped entry within that group. The indx column divided those data points to 4 groups (i.e. 0, 1, 2, 3). Later in the solution, when we group_by(indx,group) we get each of these overlapping ones together and we get the first starting time and last ending time to make the desired output.

To make the solution less prone to errors (in case we had a datapoint which was starting sooner but ending later than the other ones in a group (group and index) like what we have in the datapooints with the id of 6 and 7) I dropped first() and last() in favor of min() and max().

So...

my_time_intervals %>% 
  group_by(group) %>% arrange(group,start_time) %>% 
  mutate(indx = c(0, cumsum(as.numeric(lead(start_time)) >
                              cummax(as.numeric(end_time)))[-n()])) %>%
  group_by(group, indx) %>%
  summarise(start_time = min(start_time), end_time = max(end_time)) 


# # A tibble: 7 x 4
# # Groups:   group [?]
# group  indx start_time          end_time           
# <int> <dbl> <dttm>              <dttm>             
# 1     1     0 2018-04-12 11:15:03 2018-05-23 08:13:06
# 2     1     1 2018-07-04 02:53:20 2018-07-14 18:09:01
# 3     1     2 2018-07-15 01:53:20 2018-07-19 18:09:01
# 4     1     3 2018-07-20 02:53:20 2018-07-22 18:09:01
# 5     2     0 2018-02-28 17:43:29 2018-08-12 12:56:37
# 6     2     1 2018-10-02 14:08:03 2018-11-08 00:01:23
# 7     3     0 2018-03-11 22:30:51 2018-10-20 21:01:42

We used the unique index of each overlapping time and date to get the period (start and end) for each of them.

Beyond this point, you need to read about cumsum and cummax and also look at the output of these two functions for this specific problem to understand why the comparison that I made, ended up giving us unique identifiers for each of the overlapping time and dates.

Diplostemonous answered 8/11, 2018 at 18:41 Comment(7)
Thank you @Masoud, the code worked for me this time (I'll remove the edit to my original question). It's hard for me to decypher what the mutate() line is doing, can you explain? Thanks!Kilter
Specifically, I don't understand what cumsum(as.numeric(lead(start_time)) > cummax(as.numeric(end_time)) )[-n()] is doing... Can someone explain? Thanks!Kilter
@Kilter Sorry, a bit busy today to illustrate that right now. But what you can do is mutate each part of that whole comparison and see their output. For example, mutate(cumsum(as.numeric(lead(start_time))) and look at the output.Diplostemonous
Thanks @Masoud, I tried to break down the mutate() call and it definitely helped me understand it better. That said, I still don't get how indx becomes numerical and increments for each contiguous time interval. If you get a chance to explain the process a bit more I'd be super grateful!!Kilter
That's very helpful. Sorry super last question: What's the role of the 0 and [-n()] in the c()?Kilter
lead drops one entry from the end of data and puts NA. read ?lead(). Look at lead(my_time_intervals$start_time). I am getting rid of that by [-n()]. n() in tidyverse gives the last row. I need the same size as the data for mutate(); so, I am adding 0 to the beginning. Why 0? Because the first row is the same as the one following it (in terms of overlapping). And cumsum will start from 0. See the outputs within the piping because outside of it you cannot see the effect of grouping and you cannot use n() (for the latter you can manually define the last row though). Cheers.Diplostemonous
@Kilter read the above comment. Meanwhile, you need to read about all the functions used in this solution separately and then follow their use step by step in the solution. type in R the following command for each function: ?name_of_the_package::name_of_the_function(). This will help you better understand the functions stand alone and then investigating them in this particular solution is the next step.Diplostemonous
P
2

Another tidyverse method:

library(tidyverse)
library(lubridate)

my_time_intervals %>%
  arrange(group, start_time) %>%
  group_by(group) %>%
  mutate(new_end_time = if_else(end_time >= lead(start_time), lead(end_time), end_time),
         g = new_end_time != end_time | is.na(new_end_time),
         end_time = if_else(end_time != new_end_time & !is.na(new_end_time), new_end_time, end_time)) %>%
  filter(g) %>%
  select(-new_end_time, -g)
Prussiate answered 8/11, 2018 at 18:59 Comment(3)
Thanks @avid_useR, one question: What does g = new_end_time != end_time | is.na(new_end_time) mean? I don't understand the = followed by != then |...Kilter
@Kilter new_end_time != end_time | is.na(new_end_time) is a logical expression, which returns TRUE if either new_end_time is not equal to (!=) end_time, or (|) new_end_time equals to NA. The result is assigned to the variable g. The idea is that for end_time that overlaps with the next start_time, end_time is replaced with the next end_time. g allows me to remove the unneeded "next rows" after it is merged with the current overlapping row using filter.Prussiate
Thank you for the explanation, that makes sense! However, in group == 2 in my output when running your code, I see an interval from 2018-02-28 to 2018-04-19 when it should be from 2018-02-28 to 2018-08-12. This is because there were three overlapping intervals in the original data instead of two. And in my real, full dataset, there might be many more than three overlapping intervals. Can your solution address this? Thanks!Kilter
S
1

We could sort by start_time, then nest and use reduce in subtables to merge rows when relevant (using Masoud's data) :

library(tidyverse)
df %>% 
  arrange(start_time) %>% # 
  select(-id) %>%
  nest(start_time, end_time,.key="startend") %>%
  mutate(startend = map(startend,~reduce(
    seq(nrow(.))[-1],
    ~ if(..3[.y,1] <= .x[nrow(.x),2]) 
        if(..3[.y,2] > .x[nrow(.x),2]) `[<-`(.x, nrow(.x), 2, value = ..3[.y,2])
        else .x
      else bind_rows(.x,..3[.y,]),
    .init = .[1,],
    .))) %>%
  arrange(group) %>%
  unnest()

# # A tibble: 7 x 3
# group          start_time            end_time
# <int>              <dttm>              <dttm>
# 1     1 2018-04-12 13:15:03 2018-05-23 10:13:06
# 2     1 2018-07-04 04:53:20 2018-07-14 20:09:01
# 3     1 2018-07-15 03:53:20 2018-07-19 20:09:01
# 4     1 2018-07-20 04:53:20 2018-07-22 20:09:01
# 5     2 2018-02-28 18:43:29 2018-08-12 14:56:37
# 6     2 2018-10-02 16:08:03 2018-11-08 01:01:23
# 7     3 2018-03-11 23:30:51 2018-10-20 23:01:42
Shaunteshave answered 14/11, 2018 at 9:47 Comment(6)
Cheers mate. Compare your output to mine. They are not quite the same (I guess your approach assumes that if an event starts sooner it should end sooner too, not sure tho).Diplostemonous
I don't see the difference, can you tell me which row and col ?Shaunteshave
e.g. all the start times (not the dates) in the first 4 or 5 rows.Diplostemonous
Right, I'm not on a computer so i'll test later, but the values I have are not in your original data, that's weird, i'll check it later thks for commentingShaunteshave
Did you get a chance to investigate this? (Asking out of curiosity)Diplostemonous
I just did, and what happens is that ymd_hms has tz="UTC" by default, but using tribbles changes the timezone to my local timezone, "CEST", and then the printing method of tibble doesn't display the timezone so you can't tell. Thus the data is "correct" but the display is wrong. Not sure if it qualifies as a bug but it's certainly counter intuitive, I'll file a github issue.Shaunteshave
B
1

I think this problem can be solved very elegantly with a combination of dplyr and the ivs package, which is a package for working with interval vectors exactly like this.

The key here is iv_group(), which merges all overlapping intervals and returns the set of intervals that remain after all overlaps have been merged.

library(tidyverse)
library(lubridate)
library(ivs)

my_time_intervals <- tribble(
  ~id, ~group, ~start_time, ~end_time,
  1L, 1L, ymd_hms("2018-04-12 11:15:03"), ymd_hms("2018-05-14 02:32:10"),
  2L, 1L, ymd_hms("2018-07-04 02:53:20"), ymd_hms("2018-07-14 18:09:01"),
  3L, 1L, ymd_hms("2018-05-07 13:02:04"), ymd_hms("2018-05-23 08:13:06"),
  4L, 2L, ymd_hms("2018-02-28 17:43:29"), ymd_hms("2018-04-20 03:48:40"),
  5L, 2L, ymd_hms("2018-04-20 01:19:52"), ymd_hms("2018-08-12 12:56:37"),
  6L, 2L, ymd_hms("2018-04-18 20:47:22"), ymd_hms("2018-04-19 16:07:29"),
  7L, 2L, ymd_hms("2018-10-02 14:08:03"), ymd_hms("2018-11-08 00:01:23"),
  8L, 3L, ymd_hms("2018-03-11 22:30:51"), ymd_hms("2018-10-20 21:01:42")
)

# Combine the start/end boundaries into a single interval vector
my_time_intervals <- my_time_intervals %>%
  mutate(time = iv(start_time, end_time), .keep = "unused")

# Note that these are half-open intervals, but that won't affect anything here
my_time_intervals
#> # A tibble: 8 × 3
#>      id group                                       time
#>   <int> <int>                                 <iv<dttm>>
#> 1     1     1 [2018-04-12 11:15:03, 2018-05-14 02:32:10)
#> 2     2     1 [2018-07-04 02:53:20, 2018-07-14 18:09:01)
#> 3     3     1 [2018-05-07 13:02:04, 2018-05-23 08:13:06)
#> 4     4     2 [2018-02-28 17:43:29, 2018-04-20 03:48:40)
#> 5     5     2 [2018-04-20 01:19:52, 2018-08-12 12:56:37)
#> 6     6     2 [2018-04-18 20:47:22, 2018-04-19 16:07:29)
#> 7     7     2 [2018-10-02 14:08:03, 2018-11-08 00:01:23)
#> 8     8     3 [2018-03-11 22:30:51, 2018-10-20 21:01:42)

# For each `group` compute the interval "groups". These represent the collapsed
# date-time intervals that you are looking for.
my_time_intervals %>%
  group_by(group) %>%
  summarise(time = iv_groups(time), .groups = "drop")
#> # A tibble: 5 × 2
#>   group                                       time
#>   <int>                                 <iv<dttm>>
#> 1     1 [2018-04-12 11:15:03, 2018-05-23 08:13:06)
#> 2     1 [2018-07-04 02:53:20, 2018-07-14 18:09:01)
#> 3     2 [2018-02-28 17:43:29, 2018-08-12 12:56:37)
#> 4     2 [2018-10-02 14:08:03, 2018-11-08 00:01:23)
#> 5     3 [2018-03-11 22:30:51, 2018-10-20 21:01:42)

Created on 2022-04-05 by the reprex package (v2.0.1)

Bochum answered 5/4, 2022 at 14:55 Comment(1)
This looks elegant and a fantastic solution, thank you @Davis Vaughan! I'll add this to my tool kit.Kilter

© 2022 - 2024 — McMap. All rights reserved.