How to speed up iteration while working on a data-frame with over 5 million observations in r?

Asked 12/2, 2019 at 19:19 Answered 5/7, 2021 at 11:5

Solved r dplyr iteration tidyverse purrr

I am trying to generate values for over 7 variables across millions of observations and it's taking forever when I write a for loop to achieve this. Below is an example of what I am trying to achieve. In this case it's fast since it has only a few thousand observations:

# Load dplyr


library(tidyverse)
set.seed(50)

df <- data_frame(SlNo = 1:2000,
                 Scenario = rep(c(1, 2, 3, 4),500),
                 A = round(rnorm(2000, 11, 6)),
                 B = round(rnorm(2000, 15, 4))) %>%
      arrange(Scenario) 

#splitting data-frame to add multiple rows in the data-frame

df<- df %>% split(f = .$Scenario) %>%
  map_dfr(~bind_rows(tibble(Scenario = 0), .x)) 

#observations for certain variables in the newly added rows have specific values

df <- df %>% mutate(C = if_else(Scenario != 0, 0, 4),
                    E = if_else(Scenario != 0, 0, 6))

for(i in 2:nrow(df)) {

df$C[i] <- if_else(df$Scenario[i] != 0, (1-0.5) * df$C[i-1] + 3 + 2 + df$B[i] + df$E[i-1],
              df$C[i])
df$E[i] <- if_else(df$Scenario[i] != 0, df$C[i] + df$B[i] - 50, df$E[i])


}

df

# A tibble: 2,004 x 6
   Scenario  SlNo     A     B     C      E
      <dbl> <int> <dbl> <dbl> <dbl>  <dbl>
 1        0    NA    NA    NA   4     6   
 2        1     1    14    19  32     1   
 3        1     5     1    13  35    -2   
 4        1     9    17    20  40.5  10.5 
 5        1    13     8     7  42.8  -0.25
 6        1    17    10    16  42.1   8.12
 7        1    21     9    12  46.2   8.19
 8        1    25    14    18  54.3  22.3 
 9        1    29    14    15  69.4  34.4 
10        1    33     4    17  91.1  58.1 
# ... with 1,994 more rows

I'd like to produce similar results quickly while working with larger data frames. I appreciate any help on this. Thank you in advance!!

Kneedeep answered 12/2, 2019 at 19:19 Comment(4)

Could you write out what your aim is in words? I think it makes it faster to think of a solution rather than trying to figure out from your code. Also have you looked at using data.table and replacing the for loop with some faster method? – Roundsman 12/2, 2019 at 19:49

The data-frame I am working on contains millions of rows & 7 variables, out of which I need to iteratively calculate the values for three variables. In this example there are 4 scenarios and the code is written to calculate values for var. C & E for each scenario. The 1st observation of C & E for each scenario is assigned a specific value. The values of each subsequent observation for each of these variables, depend on the preceding values of the same variable i.e C[i] value depends on C[i-1] and E[i-1]. The for loop written to achieve this is very slow when applied to a DF with over 5M obs.. – Kneedeep 12/2, 2019 at 21:3

This looks it's just algebra, though complicated. If you write it out, I bet you can come up with something that just uses cumsum and lag. – Merited 12/2, 2019 at 21:55

It would have easier if you have written your formula in words too. – Integrity 5/7, 2021 at 4:5

In tidyverse you may use purrr::accumulate like this

library(tidyverse)
set.seed(50)

df <- data.frame(SlNo = 1:2000,
                 Scenario = rep(c(1, 2, 3, 4),500),
                 A = round(rnorm(2000, 11, 6)),
                 B = round(rnorm(2000, 15, 4))) %>%
  arrange(Scenario)

df %>%
  nest(data = B) %>%
  group_by(Scenario) %>%
  mutate(new = accumulate(data, 
                          .init = tibble(C = 4, E = 6),
                          ~ tibble(C = (1 -0.5)* .x$C + 5 + .y$B + .x$E,
                                   E = 0.5 * .x$C + 5 + .x$E + 2 * .y$B - 50
                                   )
                          )[-1]
         ) %>% ungroup %>%
  unnest_wider(data) %>%
  unnest_wider(new)

#> # A tibble: 2,000 x 6
#>     SlNo Scenario     A     B     C     E
#>    <int>    <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1     1        1    14    19  32    1   
#>  2     5        1     1    13  35   -2   
#>  3     9        1    17    20  40.5 10.5 
#>  4    13        1     8     7  42.8 -0.25
#>  5    17        1    10    16  42.1  8.12
#>  6    21        1     9    12  46.2  8.19
#>  7    25        1    14    18  54.3 22.3 
#>  8    29        1    14    15  69.4 34.4 
#>  9    33        1     4    17  91.1 58.1 
#> 10    37        1    13    15 124.  88.7 
#> # ... with 1,990 more rows

^{Created on 2021-07-05 by the reprex package (v2.0.0)}

Integrity answered 5/7, 2021 at 11:5 Comment(0)

If you don't want to transition to data.table, or dtplyr, which could be tricky in figuring out how to adapt cumsum and lag to your needed output, you could adapt your loop to be run in parallel, here an example of the code:

#install.packages("foreach")
#install.packages("doParallel")

# Loading libraries

library(foreach)
library(doParallel)
library(tidyverse)
set.seed(50)

df <- data_frame(SlNo = 1:2000,
                 Scenario = rep(c(1, 2, 3, 4),500),
                 A = round(rnorm(2000, 11, 6)),
                 B = round(rnorm(2000, 15, 4))) %>%
      arrange(Scenario) 

#splitting data-frame to add multiple rows in the data-frame

df<- df %>% split(f = .$Scenario) %>%
  map_dfr(~bind_rows(tibble(Scenario = 0), .x)) 

#observations for certain variables in the newly added rows have specific values

df <- df %>% mutate(C = if_else(Scenario != 0, 0, 4),
                    E = if_else(Scenario != 0, 0, 6))


# Setting up the cores
n.cores <- parallel::detectCores() - 1
my.cluster <- parallel::makeCluster(
        n.cores, 
        type = "PSOCK",
        .packages="dplyr"
)
doParallel::registerDoParallel(cl = my.cluster)

# Run the foreach loop in parallel
foreach(
        i = 2:nrow(df2), 
        .combine = 'rbind'
) %dopar% {
        df$C[i] <- if_else(df$Scenario[i] != 0, (1-0.5) * df$C[i-1] + 3 + 2 + df$B[i] + df$E[i-1],
                           df$C[i])
        df$E[i] <- if_else(df$Scenario[i] != 0, df$C[i] + df$B[i] - 50, df$E[i])
}
df
# stop the cluster
parallel::stopCluster(cl = my.cluster)

This should speed up your code significantly. However, not that time execution differences with parallel are evident over larger datasets, with small dataset it can actually take a bit more time to execute.

Garton answered 5/7, 2021 at 3:34 Comment(0)

In tidyverse you may use purrr::accumulate like this

library(tidyverse)
set.seed(50)

df <- data.frame(SlNo = 1:2000,
                 Scenario = rep(c(1, 2, 3, 4),500),
                 A = round(rnorm(2000, 11, 6)),
                 B = round(rnorm(2000, 15, 4))) %>%
  arrange(Scenario)

df %>%
  nest(data = B) %>%
  group_by(Scenario) %>%
  mutate(new = accumulate(data, 
                          .init = tibble(C = 4, E = 6),
                          ~ tibble(C = (1 -0.5)* .x$C + 5 + .y$B + .x$E,
                                   E = 0.5 * .x$C + 5 + .x$E + 2 * .y$B - 50
                                   )
                          )[-1]
         ) %>% ungroup %>%
  unnest_wider(data) %>%
  unnest_wider(new)

#> # A tibble: 2,000 x 6
#>     SlNo Scenario     A     B     C     E
#>    <int>    <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1     1        1    14    19  32    1   
#>  2     5        1     1    13  35   -2   
#>  3     9        1    17    20  40.5 10.5 
#>  4    13        1     8     7  42.8 -0.25
#>  5    17        1    10    16  42.1  8.12
#>  6    21        1     9    12  46.2  8.19
#>  7    25        1    14    18  54.3 22.3 
#>  8    29        1    14    15  69.4 34.4 
#>  9    33        1     4    17  91.1 58.1 
#> 10    37        1    13    15 124.  88.7 
#> # ... with 1,990 more rows

^{Created on 2021-07-05 by the reprex package (v2.0.0)}

Integrity answered 5/7, 2021 at 11:5 Comment(0)

Recommended topics

Hot tags