Conditional cumsum with reset
Asked Answered
S

5

22

I have a data frame, the data frame is already sorted as needed, but now I will like to "slice it" in groups.

This groups should have a max cumulative value of 10. When the cumulative value is > 10, it should reset the cumulative sum and start over again

library(dplyr)
id <- sample(1:15)
order <- 1:15
value  <- c(4, 5, 7, 3, 8, 1, 2, 5, 3, 6, 2, 6, 3, 1, 4)
df  <- data.frame(id, order, value)
df

This is the output I'm looking for(I did it "manually")

cumsum_10  <- c(4, 9, 7, 10, 8, 9, 2, 7, 10, 6, 8, 6, 9, 10, 4)
group_10 <- c(1, 1, 2, 2, 3, 3, 4, 4, 4, 5, 5, 6, 6, 6, 7)
df1  <- data.frame(df, cumsum_10, group_10)
df1

So I'm having 2 problems

  1. How to create a cumulative variable that resets everytime it passes an upper limit (10 in this case)
  2. How to count/group every group

For the first part I was trying some combinations of group_by and cumsum with no luck

df1 <- df %>% group_by(cumsum(c(False, value < 10)))

I would prefer a pipe (%>%) solution instead of a for loop

Thanks

Sigma answered 30/12, 2015 at 14:38 Comment(4)
The second part will be trivial (group_by) if you can figure out the first. I think it's going to be hard to do the first without a for loop, unless someone gets extremely clever. Do you want pipes for efficiency, elegance, ... ? If there were a for loop hidden in a helper function would that be OK?Kelpie
check it #29054959Attire
group_by(bin(value, 10))%>% mutate(cumsum(value)) using bin function in the linkAttire
Possible duplicate of #15467380Scrim
I
10

I think this is not easily vectorizable.... at least i do not know how.

You can do it by hand via:

my_cumsum <- function(x){
  grp = integer(length(x))
  grp[1] = 1
  for(i in 2:length(x)){
    if(x[i-1] + x[i] <= 10){
      grp[i] = grp[i-1]
      x[i] = x[i-1] + x[i]
    } else {
      grp[i] = grp[i-1] + 1
    }
  }
  data.frame(grp, x)
}

For your data this gives:

> my_cumsum(df$value)
   grp  x
1    1  4
2    1  9
3    2  7
4    2 10
5    3  8
6    3  9
7    4  2
8    4  7
9    4 10
10   5  6
11   5  8
12   6  6
13   6  9
14   6 10
15   7  4

Also for my "counter-example" this gives:

> my_cumsum(c(10,6,4))
  grp  x
1   1 10
2   2  6
3   2 10

As @Khashaa pointed out this can be implementet more efficiently via Rcpp. He linked to this answer How to speed up or vectorize a for loop? which i find very useful

Interventionist answered 30/12, 2015 at 15:15 Comment(1)
It is vectorisable using either Reduce or purrr::accumulate. See my answerNosedive
N
8

For iterative calculations like this, we can use package purrr from tidyverse. We have a function accumulate here which is meant for situations exactly like this..

library(dplyr)
library(purrr)

df %>% mutate(cumsum_10 = accumulate(value, ~ifelse(.x + .y <= 10, .x + .y, .y)),
              group_10 = cumsum(value == cumsum_10))

   id order value cumsum_10 group_10
1   8     1     4         4        1
2  13     2     5         9        1
3   7     3     7         7        2
4   1     4     3        10        2
5   4     5     8         8        3
6  10     6     1         9        3
7  12     7     2         2        4
8   2     8     5         7        4
9  15     9     3        10        4
10 11    10     6         6        5
11 14    11     2         8        5
12  3    12     6         6        6
13  5    13     3         9        6
14  9    14     1        10        6
15  6    15     4         4        7

Moreover, this can also be obtained using base R's Reduce (note Upper Case R here) by setting its argument accumulate = TRUE which will then return all intermediate values instead of last one only.

library(dplyr)
df %>% 
  mutate(cumsum_10 = Reduce(\(x, y) if (x + y <= 10) x + y else y,
                            x = value,
                            accumulate = TRUE),
         group_10 = cumsum(value == cumsum_10))

#>    id order value cumsum_10 group_10
#> 1   2     1     4         4        1
#> 2  15     2     5         9        1
#> 3   1     3     7         7        2
#> 4   7     4     3        10        2
#> 5   9     5     8         8        3
#> 6  12     6     1         9        3
#> 7   4     7     2         2        4
#> 8   6     8     5         7        4
#> 9  11     9     3        10        4
#> 10  5    10     6         6        5
#> 11  3    11     2         8        5
#> 12 13    12     6         6        6
#> 13 14    13     3         9        6
#> 14  8    14     1        10        6
#> 15 10    15     4         4        7
Nosedive answered 26/3, 2021 at 9:2 Comment(0)
H
5

You could define your own function and then use it inside dplyr's mutate statement as follows:

df %>% group_by() %>%
  mutate(
    cumsum_10 = cumsum_with_reset(value, 10),
    group_10 = cumsum_with_reset_group(value, 10)
  ) %>% 
  ungroup()

The cumsum_with_reset() function takes a column and a threshold value which resets the sum. cumsum_with_reset_group() is similar but identifies rows that have been grouped together. Definitions are as follows:

# group rows based on cumsum with reset
cumsum_with_reset_group <- function(x, threshold) {
  cumsum <- 0
  group <- 1
  result <- numeric()

  for (i in 1:length(x)) {
    cumsum <- cumsum + x[i]

    if (cumsum > threshold) {
      group <- group + 1
      cumsum <- x[i]
    }

    result = c(result, group)

  }

  return (result)
}

# cumsum with reset
cumsum_with_reset <- function(x, threshold) {
  cumsum <- 0
  group <- 1
  result <- numeric()

  for (i in 1:length(x)) {
    cumsum <- cumsum + x[i]

    if (cumsum > threshold) {
      group <- group + 1
      cumsum <- x[i]
    }

    result = c(result, cumsum)

  }

  return (result)
}

# use functions above as window functions inside mutate statement
df %>% group_by() %>%
  mutate(
    cumsum_10 = cumsum_with_reset(value, 10),
    group_10 = cumsum_with_reset_group(value, 10)
  ) %>% 
  ungroup()
Hut answered 9/2, 2017 at 9:57 Comment(0)
F
5

We can take advantage of the function cumsumbinning, from the package MESS, that performs this task:

library(MESS)
df %>%
  group_by(group_10 = cumsumbinning(value, 10)) %>%
  mutate(cumsum_10 = cumsum(value)) 

Output

# A tibble: 15 x 5
# Groups:   group_10 [7]
      id order value group_10 cumsum_10
   <int> <int> <dbl>    <int>     <dbl>
 1     6     1     4        1         4
 2    10     2     5        1         9
 3     1     3     7        2         7
 4     5     4     3        2        10
 5     3     5     8        3         8
 6     9     6     1        3         9
 7    14     7     2        4         2
 8    11     8     5        4         7
 9    15     9     3        4        10
10     8    10     6        5         6
11    12    11     2        5         8
12     2    12     6        6         6
13     4    13     3        6         9
14     7    14     1        6        10
15    13    15     4        7         4
Fulmer answered 21/8, 2018 at 7:14 Comment(0)
E
3

The function below uses recursion to construct a vector with the lengths of each group. It is faster than a loop for small data vectors (length less than about a hundred values), but slower for longer ones. It takes three arguments:

1) vec: A vector of values that we want to group.

2) i: The index of the starting position in vec.

3) glv: A vector of group lengths. This is the return value, but we need to initialize it and pass it along through each recursion.

# Group a vector based on consecutive values with a cumulative sum <= 10
gf = function(vec, i, glv) {

  ## Break out of the recursion when we get to the last group
  if (sum(vec[i:length(vec)]) <= 10) {
    glv = c(glv, length(i:length(vec)))
    return(glv)
  }

  ## Keep recursion going if there are at least two groups left
  # Calculate length of current group
  gl = sum(cumsum(vec[i:length(vec)]) <= 10)

  # Append to previous group lengths
  glv.append = c(glv, gl)

  # Call function recursively 
  gf(vec, i + gl, glv.append)
}

Run the function to return a vector of group lengths:

group_vec = gf(df$value, 1, numeric(0))
[1] 2 2 2 3 2 3 1

To add a column to df with the group lengths, use rep:

df$group10 = rep(1:length(group_vec), group_vec)

In its current form the function will only work on vectors that don't have any values greater than 10, and the grouping by sums <= 10 is hard-coded. The function can of course be generalized to deal with these limitations.

The function can be speeded up somewhat by doing cumulative sums that look ahead only a certain number of values, rather than the remaining length of the vector. For example, if the values are always positive, you only need to look ten values ahead, since you'll never need to sum more than ten numbers to reach a value of 10. This too can be generalized for any target value. Even with this modification, the function is still slower than a loop for a vector with more than about a hundred values.

I haven't worked with recursive functions in R before and would be interested in any comments and suggestions on whether recursion makes sense for this type of problem and whether it can be improved, especially execution speed.

Epperson answered 31/12, 2015 at 4:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.