I am developing a tidyverse
-based data workflow, and came across a situation where I have a data frame with lots of time intervals. Let's call the data frame my_time_intervals
, and it can be reproduced like this:
library(tidyverse)
library(lubridate)
my_time_intervals <- tribble(
~id, ~group, ~start_time, ~end_time,
1L, 1L, ymd_hms("2018-04-12 11:15:03"), ymd_hms("2018-05-14 02:32:10"),
2L, 1L, ymd_hms("2018-07-04 02:53:20"), ymd_hms("2018-07-14 18:09:01"),
3L, 1L, ymd_hms("2018-05-07 13:02:04"), ymd_hms("2018-05-23 08:13:06"),
4L, 2L, ymd_hms("2018-02-28 17:43:29"), ymd_hms("2018-04-20 03:48:40"),
5L, 2L, ymd_hms("2018-04-20 01:19:52"), ymd_hms("2018-08-12 12:56:37"),
6L, 2L, ymd_hms("2018-04-18 20:47:22"), ymd_hms("2018-04-19 16:07:29"),
7L, 2L, ymd_hms("2018-10-02 14:08:03"), ymd_hms("2018-11-08 00:01:23"),
8L, 3L, ymd_hms("2018-03-11 22:30:51"), ymd_hms("2018-10-20 21:01:42")
)
Here's a tibble
view of the same data frame:
> my_time_intervals
# A tibble: 8 x 4
id group start_time end_time
<int> <int> <dttm> <dttm>
1 1 1 2018-04-12 11:15:03 2018-05-14 02:32:10
2 2 1 2018-07-04 02:53:20 2018-07-14 18:09:01
3 3 1 2018-05-07 13:02:04 2018-05-23 08:13:06
4 4 2 2018-02-28 17:43:29 2018-04-20 03:48:40
5 5 2 2018-04-20 01:19:52 2018-08-12 12:56:37
6 6 2 2018-04-18 20:47:22 2018-04-19 16:07:29
7 7 2 2018-10-02 14:08:03 2018-11-08 00:01:23
8 8 3 2018-03-11 22:30:51 2018-10-20 21:01:42
A few notes about my_time_intervals
:
The data is divided into three groups via the
group
variable.The
id
variable is just a unique ID for each row in the data frame.The start and end of time intervals are stored in
start_time
andend_time
inlubridate
form.Some time intervals overlap, some don't, and they are not always in order. For example, row
1
overlaps with row3
, but neither of them overlaps with row2
.More than two intervals may overlap with each other, and some intervals fall completely within others. See rows
4
through6
ingroup == 2
.
What I want is that within each group
, collapse any overlapping time intervals into contiguous intervals. In this case, my desired result would look like:
# A tibble: 5 x 4
id group start_time end_time
<int> <int> <dttm> <dttm>
1 1 1 2018-04-12 11:15:03 2018-05-23 08:13:06
2 2 1 2018-07-04 02:53:20 2018-07-14 18:09:01
3 4 2 2018-02-28 17:43:29 2018-08-12 12:56:37
4 7 2 2018-10-02 14:08:03 2018-11-08 00:01:23
5 8 3 2018-03-11 22:30:51 2018-10-20 21:01:42
Notice that time intervals that overlap between different group
s are not merged. Also, I don't care about what happens to the id
column at this point.
I know that the lubridate
package includes interval-related functions, but I can't figure out how to apply them to this use case.
How can I achieve this?
my_time_intervals %>% group_by(group) %>% arrange(start_time) %>% mutate(indx = c(0, cumsum(as.numeric(lead(start_time)) > cummax(as.numeric(end_time)))[-n()])) %>% group_by(group, indx) %>% summarise(start_time = first(start_time), end_time = last(end_time)) %>% select(-indx)
– Diplostemonousarrange
. It works perfectly. – Diplostemonous