do() superseded! Alternative is to use across(), nest_by(), and summarise, how?
Asked Answered
B

5

6

I'm doing something quite simple. Given a dataframe of start dates and end dates for specific periods I want to expand/create a full sequence for each period binned by week (with the factor for each row), then output this in a single large dataframe.

For instance:

library(tidyverse)
library(lubridate)

# Dataset
  start_dates = ymd_hms(c("2019-05-08 00:00:00",
                          "2020-01-17 00:00:00",
                          "2020-03-03 00:00:00",
                          "2020-05-28 00:00:00",
                          "2020-12-10 00:00:00",
                          "2021-05-07 00:00:00",
                          "2022-01-04 00:00:00"), tz = "UTC")
  
  end_dates = ymd_hms(c( "2019-10-24 00:00:00",
                         "2020-03-03 00:00:00", 
                         "2020-05-28 00:00:00",
                         "2020-12-10 00:00:00",
                         "2021-05-07 00:00:00",
                         "2022-01-04 00:00:00",
                         "2022-01-19 00:00:00"), tz = "UTC") 
  
  df1 = data.frame(studying = paste0("period",seq(1:7),sep = ""),start_dates,end_dates)

It was suggested to me to use do(), which currently works fine but I hate it when things are superseded. I also have a way of doing it using map2. But reading the file (https://dplyr.tidyverse.org/reference/do.html) suggests you can use nest_by(), across() and summarise() to do the same job as do(), how would I go about getting same result? I've tried a lot of things but I just can't seem to get it.

# do() way to do it
df1 %>% 
  group_by(studying) %>% 
  do(data.frame(week=seq(.$start_dates,.$end_dates,by="1 week")))
# transmute() way to do it
 df1 %>% 
  transmute(weeks = map2(start_dates,end_dates, seq, by = "1 week"), studying) 
 %>% unnest(cols = c(weeks))
Bainite answered 19/1, 2022 at 15:58 Comment(3)
across is going to iterate across columns, but you want to operate on both date columns simultaneously, so I don't think it is appropriate. nest_by(x, y) is effectively the same as group_by(x, y) %>% nest(), so it also doesn't really do what you need, since you'll still need to iterate over each group to come up with the sequence. summarize can be used in place of transmute. In the end, your transmute code (and my deleted answer, posted before reading the rest of your question) is the way to go.Dangerous
What about the transmute version isn't what you want? I think the nest / across / summarise suggestion is just that, a suggestion, based on the assumption that do was often used for more complicated summarizations or operations to be done on entire data framesEs
Nothing specific, It's just I was interested in knowing alternatives. As they specifically mention nest_by() I thought it could be useful to know.Bainite
C
3

As the documentation of ?do suggests, we can now use summarise and replace the . with across():

library(tidyverse)
library(lubridate)

df1 %>% 
  group_by(studying) %>% 
  summarise(week = seq(across()$start_dates,
                       across()$end_dates,
                       by = "1 week"))
#> `summarise()` has grouped output by 'studying'. You can override using the
#> `.groups` argument.
#> # A tibble: 134 x 2
#> # Groups:   studying [7]
#>    studying week               
#>    <chr>    <dttm>             
#>  1 period1  2019-05-08 00:00:00
#>  2 period1  2019-05-15 00:00:00
#>  3 period1  2019-05-22 00:00:00
#>  4 period1  2019-05-29 00:00:00
#>  5 period1  2019-06-05 00:00:00
#>  6 period1  2019-06-12 00:00:00
#>  7 period1  2019-06-19 00:00:00
#>  8 period1  2019-06-26 00:00:00
#>  9 period1  2019-07-03 00:00:00
#> 10 period1  2019-07-10 00:00:00
#> # … with 124 more rows

Created on 2022-01-19 by the reprex package (v0.3.0)

Cabal answered 19/1, 2022 at 18:47 Comment(2)
Thanks everyone for the input. Some really interesting approaches and all great. I chose this as top answer as I guess it might be more in the "across" spirit. What is the data.frame() for out of interest. Does it not produce the same without?Bainite
@Dasr: True, we don't need the data.frame call. Nice spot. I just took your code and replaced do with summarise and the dot . with across without realizing that we don't need to wrap this in data.frame anymore :)Cabal
A
3

You can also use tidyr::complete:

df1 %>% 
  group_by(studying) %>% 
  complete(start_dates = seq(from = start_dates, to = end_dates, by = "1 week")) %>% 
  select(-end_dates, weeks = start_dates)

# A tibble: 134 x 2
# Groups:   studying [7]
   studying weeks              
   <chr>    <dttm>             
 1 period1  2019-05-08 00:00:00
 2 period1  2019-05-15 00:00:00
 3 period1  2019-05-22 00:00:00
 4 period1  2019-05-29 00:00:00
 5 period1  2019-06-05 00:00:00
 6 period1  2019-06-12 00:00:00
 7 period1  2019-06-19 00:00:00
 8 period1  2019-06-26 00:00:00
 9 period1  2019-07-03 00:00:00
10 period1  2019-07-10 00:00:00
# ... with 124 more rows
Abstinence answered 19/1, 2022 at 17:11 Comment(0)
C
3

As the documentation of ?do suggests, we can now use summarise and replace the . with across():

library(tidyverse)
library(lubridate)

df1 %>% 
  group_by(studying) %>% 
  summarise(week = seq(across()$start_dates,
                       across()$end_dates,
                       by = "1 week"))
#> `summarise()` has grouped output by 'studying'. You can override using the
#> `.groups` argument.
#> # A tibble: 134 x 2
#> # Groups:   studying [7]
#>    studying week               
#>    <chr>    <dttm>             
#>  1 period1  2019-05-08 00:00:00
#>  2 period1  2019-05-15 00:00:00
#>  3 period1  2019-05-22 00:00:00
#>  4 period1  2019-05-29 00:00:00
#>  5 period1  2019-06-05 00:00:00
#>  6 period1  2019-06-12 00:00:00
#>  7 period1  2019-06-19 00:00:00
#>  8 period1  2019-06-26 00:00:00
#>  9 period1  2019-07-03 00:00:00
#> 10 period1  2019-07-10 00:00:00
#> # … with 124 more rows

Created on 2022-01-19 by the reprex package (v0.3.0)

Cabal answered 19/1, 2022 at 18:47 Comment(2)
Thanks everyone for the input. Some really interesting approaches and all great. I chose this as top answer as I guess it might be more in the "across" spirit. What is the data.frame() for out of interest. Does it not produce the same without?Bainite
@Dasr: True, we don't need the data.frame call. Nice spot. I just took your code and replaced do with summarise and the dot . with across without realizing that we don't need to wrap this in data.frame anymore :)Cabal
S
3

Although marked Experimental the help file for group_modify does say that

‘group_modify()’ is an evolution of ‘do()’

and, in fact, the code for the example in the question using group_modify is nearly the same as with do.

# with group_modify
df2 <- df1 %>% 
  group_by(studying) %>% 
  group_modify(~ data.frame(week = seq(.$start_dates, .$end_dates, by = "1 week")))

# with do
df0 <- df1 %>% 
  group_by(studying) %>% 
  do(data.frame(week = seq(.$start_dates, .$end_dates, by = "1 week")))

identical(df2, df0)
## [1] TRUE
Selfpronouncing answered 19/1, 2022 at 22:42 Comment(0)
O
2

Not sure if this exactly what you are looking for, but here is my attempt with rowwise and unnest

df1 %>% 
  rowwise() %>% 
  mutate(week = list(seq(start_dates, end_dates, by = "1 week"))) %>% 
  select(studying, week) %>% 
  unnest(cols = c(week))
Oophorectomy answered 19/1, 2022 at 16:23 Comment(1)
Cool approach, I'll leave it up a little while and see what other lines people are taking :)Bainite
Z
0

Another approach:

library(tidyverse)

df1 %>%
    group_by(studying) %>%
    summarise(df = tibble(weeks = seq(start_dates, end_dates, by = 'week'))) %>%
    unnest(df)
#> `summarise()` has grouped output by 'studying'. You can override using the `.groups` argument.
#> # A tibble: 134 × 2
#> # Groups:   studying [7]
#>    studying weeks              
#>    <chr>    <dttm>             
#>  1 period1  2019-05-08 00:00:00
#>  2 period1  2019-05-15 00:00:00
#>  3 period1  2019-05-22 00:00:00
#>  4 period1  2019-05-29 00:00:00
#>  5 period1  2019-06-05 00:00:00
#>  6 period1  2019-06-12 00:00:00
#>  7 period1  2019-06-19 00:00:00
#>  8 period1  2019-06-26 00:00:00
#>  9 period1  2019-07-03 00:00:00
#> 10 period1  2019-07-10 00:00:00
#> # … with 124 more rows

Created on 2022-01-20 by the reprex package (v2.0.1)

Zarger answered 20/1, 2022 at 6:15 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.