Fast way to calculate values in cells based on values in previous rows
Asked Answered
E

3

5

I wish to have a fast way to deal with rowwise calculations where values of cells depend on values in previous rows of different columns, prefering vectorization over looping through individual rows (follow-up from here).

Say I have the following dataset dt and a constant (loaded libraries are data.table, dplyr and purrr)

dt <- structure(list(var1 = c(-92186.7470607738, -19163.5035325072, 
-18178.8396858014, -9844.67882723287, -16494.7802822178, -17088.0576319257
), var2 = c(-3.12, NA, NA, NA, NA, NA), var3 = c(1, NA, NA, NA, 
NA, NA)), class = c("data.table", "data.frame"), row.names = c(NA, 
-6L))

constant <- 608383


print(dt)
         var1  var2 var3
1: -92186.747 -3.12    1
2: -19163.504    NA   NA
3: -18178.840    NA   NA
4:  -9844.679    NA   NA
5: -16494.780    NA   NA
6: -17088.058    NA   NA

The fast, vectorized equivalent of

for(i in 2:nrow(dt)){
  prev <- dt[(i-1),]
  dt[i, var2 := prev$var2 - var1/constant]
}

would be

dt %>%
  mutate(var2 = accumulate(var1[-1], .init = var2[1], ~ .x - .y /constant))

But what if I want to include more columns in the calculation? In this example var3, but in the real dataset there are >10 columns. I wish the solution to keep that into account. Example for loop (desired output):

for(i in 2:nrow(dt)){
  prev <- dt[(i-1),]
  dt[i, var2 := prev$var2 + prev$var3 - var1/constant]
  dt[i, var3 := prev$var1 + 0.1 * var2/constant]
}
print(dt)

         var1          var2      var3
1: -92186.747 -3.120000e+00      1.00
2: -19163.504 -2.088501e+00 -92186.75
3: -18178.840 -9.218881e+04 -19163.52
4:  -9844.679 -1.113523e+05 -18178.86
5: -16494.780 -1.295311e+05  -9844.70
6: -17088.058 -1.393758e+05 -16494.80
Eliathas answered 13/7, 2021 at 13:38 Comment(1)
Though I have included a baseR way that can be emulated for any number of variables, you may still have a reference on this answerPushup
P
3

Though My friend's output/strategy is fabulous, but since we cannot have two input vectors in baseR's Reduce() so I used this trick-

  • Generated fresh values of var1 in data.frame() inside the Reduce()
  • Where you want to use current values of var1 use .y
  • where previous values were to be used use .x$var1 instead.
  • used formula where I'd require to use current generated value of any variable.
  • rest is pretty clear I think.
  • accumulate = TRUE is obvious becuase you want all intermediate values.
  • Since output here will be a list, that is rbind using do.call

In base R you can do

do.call(rbind, Reduce(function(.x, .y) {data.frame(var1 = .y,
                            var2 = .x$var2 + .x$var3 -.y/constant,
                            var3 = .x$var1 + 0.1 * (.x$var2 + .x$var3 -.y/constant)/constant)}, 
       dt$var1[-1], 
       init =  data.frame(var1 = dt$var1[1], var2 = -3.12, var3 = 1),
       accumulate = TRUE))

        var1          var2      var3
1 -92186.747 -3.120000e+00      1.00
2 -19163.504 -2.088501e+00 -92186.75
3 -18178.840 -9.218881e+04 -19163.52
4  -9844.679 -1.113523e+05 -18178.86
5 -16494.780 -1.295311e+05  -9844.70
6 -17088.058 -1.393758e+05 -16494.80

which can be emulated in tidyverse/purrr as follows

library(purrr)
accumulate(dt$var1[-1], .init = data.frame(var1 = dt$var1[1], var2 = -3.12, var3 = 1), 
           ~ data.frame(var1 = .y,
                        var2 = .x$var2 + .x$var3 -(.y/constant),
                        var3 = .x$var1 + 0.1 * (.x$var2 + .x$var3 -(.y/constant))/constant)) %>% map_df(~.x)

        var1          var2      var3
1 -92186.747 -3.120000e+00      1.00
2 -19163.504 -2.088501e+00 -92186.75
3 -18178.840 -9.218881e+04 -19163.52
4  -9844.679 -1.113523e+05 -18178.86
5 -16494.780 -1.295311e+05  -9844.70
6 -17088.058 -1.393758e+05 -16494.80
Pushup answered 13/7, 2021 at 14:23 Comment(4)
Awesome base R solution!Coset
Amazing! Thanks alot (again) for your help :)Eliathas
Can a method like this be followed with two defined input vectors? For example a fully filled column var4Eliathas
@maarvd, yes why not? see this answer proposed on another question.Pushup
T
4

Another option using Rcpp:

library(Rcpp)
cppFunction('List func(NumericVector var1, double c, double v2, double v3) {
    int n = var1.size();
    NumericVector var2(n);
    NumericVector var3(n);
    
    var2[0] = v2;
    var3[0] = v3;
    for (int i = 1; i < n; i++) {
        var2[i] = var2[i-1] + var3[i-1] - var1[i]/c;
        var3[i] = var1[i-1] + 0.1 * var2[i]/c;
    }
    
    List ret;
    ret["var2"] = var2;
    ret["var3"] = var3;
    return ret;
}')
dt[, c("var2", "var3") := func(var1, constant, var2[1L], var3[1L])]
Triserial answered 13/7, 2021 at 22:53 Comment(1)
Thank you for the interesting solution! I was not aware of this package, integration of R with C++ seems very useful when wanting fast codeEliathas
P
3

Though My friend's output/strategy is fabulous, but since we cannot have two input vectors in baseR's Reduce() so I used this trick-

  • Generated fresh values of var1 in data.frame() inside the Reduce()
  • Where you want to use current values of var1 use .y
  • where previous values were to be used use .x$var1 instead.
  • used formula where I'd require to use current generated value of any variable.
  • rest is pretty clear I think.
  • accumulate = TRUE is obvious becuase you want all intermediate values.
  • Since output here will be a list, that is rbind using do.call

In base R you can do

do.call(rbind, Reduce(function(.x, .y) {data.frame(var1 = .y,
                            var2 = .x$var2 + .x$var3 -.y/constant,
                            var3 = .x$var1 + 0.1 * (.x$var2 + .x$var3 -.y/constant)/constant)}, 
       dt$var1[-1], 
       init =  data.frame(var1 = dt$var1[1], var2 = -3.12, var3 = 1),
       accumulate = TRUE))

        var1          var2      var3
1 -92186.747 -3.120000e+00      1.00
2 -19163.504 -2.088501e+00 -92186.75
3 -18178.840 -9.218881e+04 -19163.52
4  -9844.679 -1.113523e+05 -18178.86
5 -16494.780 -1.295311e+05  -9844.70
6 -17088.058 -1.393758e+05 -16494.80

which can be emulated in tidyverse/purrr as follows

library(purrr)
accumulate(dt$var1[-1], .init = data.frame(var1 = dt$var1[1], var2 = -3.12, var3 = 1), 
           ~ data.frame(var1 = .y,
                        var2 = .x$var2 + .x$var3 -(.y/constant),
                        var3 = .x$var1 + 0.1 * (.x$var2 + .x$var3 -(.y/constant))/constant)) %>% map_df(~.x)

        var1          var2      var3
1 -92186.747 -3.120000e+00      1.00
2 -19163.504 -2.088501e+00 -92186.75
3 -18178.840 -9.218881e+04 -19163.52
4  -9844.679 -1.113523e+05 -18178.86
5 -16494.780 -1.295311e+05  -9844.70
6 -17088.058 -1.393758e+05 -16494.80
Pushup answered 13/7, 2021 at 14:23 Comment(4)
Awesome base R solution!Coset
Amazing! Thanks alot (again) for your help :)Eliathas
Can a method like this be followed with two defined input vectors? For example a fully filled column var4Eliathas
@maarvd, yes why not? see this answer proposed on another question.Pushup
C
2

Here is another solution in base R you could use:

do.call(rbind, Reduce(function(x, y) {
  data.frame(var1 = dt$var1[y], 
             var2 = x[["var2"]] + x[["var3"]] - (dt$var1[y] / constant),
             var3 = dt$var1[y - 1] + 0.1 * ((x[["var2"]] + x[["var3"]] - (dt$var1[y] / constant)) / constant))
}, init = data.frame(var1 = dt$var1[1], var2 = -3.12, var3 = 1), 2:nrow(dt), accumulate = TRUE))

        var1          var2      var3
1 -92186.747 -3.120000e+00      1.00
2 -19163.504 -2.088501e+00 -92186.75
3 -18178.840 -9.218881e+04 -19163.52
4  -9844.679 -1.113523e+05 -18178.86
5 -16494.780 -1.295311e+05  -9844.70
6 -17088.058 -1.393758e+05 -16494.80

I think you can use the following solution. Here are some notes on how it works:

  • In this question we need to fill 2 vectors of length 6, two of which are already specified through .init argument and contrary to the previous question we are populating two variables so we need to create a tibble and start from there
  • There remains 5 other varlues to populate as we supplied .init the first and second vector should be of equal length, otherwise the second vector should be one element shorter than the first one (without .init)
  • Since we are dealing with actual and previous value of var1, I decided to use it twice each time omitting first and last value respectively, so that for example in calculating var3 where you need prev$var1 it is actually the first value of the second variable var1[-n()]
  • ..1 is always the accumulated/previous value, here since we got two var2 and var3 we can subset it with $ to specify which one we are referring to
  • ..2 is the next value in sequence from first vector .x in general and var1[-1] here and ..3 is the next value in sequence from second vector .y in general and var1[-n()] here

If these notes were not suffice I would be glad to explain more.

library(purrr)

dt[,1] %>%
  bind_cols(dt %>%
              mutate(output = accumulate2(var1[-1], var1[-n()], .init = tibble(var2 = -3.12, var3 = 1), 
                                          ~ tibble(var2 = (..1$var2 + ..1$var3 - (..2/constant)),
                                                   var3 = ..3 + 0.1 * ((..1$var2 + ..1$var3 - (..2/constant)) /constant)))) %>%
              select(output) %>%
              unnest(output))

         var1          var2      var3
1: -92186.747 -3.120000e+00      1.00
2: -19163.504 -2.088501e+00 -92186.75
3: -18178.840 -9.218881e+04 -19163.52
4:  -9844.679 -1.113523e+05 -18178.86
5: -16494.780 -1.295311e+05  -9844.70
6: -17088.058 -1.393758e+05 -16494.80
Coset answered 13/7, 2021 at 14:30 Comment(1)
Thank you for the great answer (again) and the explanation. The rationale is clear, but to totally comprehend it I will play around with some different example data.Eliathas

© 2022 - 2025 — McMap. All rights reserved.