vectorize cumsum by factor in R
Asked Answered
Z

2

5

I am trying to create a column in a very large data frame (~ 2.2 million rows) that calculates the cumulative sum of 1's for each factor level, and resets when a new factor level is reached. Below is some basic data that resembles my own.

itemcode <- c('a1', 'a1', 'a1', 'a1', 'a1', 'a2', 'a2', 'a3', 'a4', 'a4', 'a5', 'a6', 'a6', 'a6', 'a6')
goodp <- c(0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1)
df <- data.frame(itemcode, goodp)

I would like the output variable, cum.goodp, to look like this:

cum.goodp <- c(0, 1, 2, 0, 1, 1, 2, 0, 0, 1, 1, 1, 2, 0, 1)

I get that there is a lot out there using the canonical split-apply-combine approach, which, conceptually is intuitive, but I tried using the following:

k <- transform(df, cum.goodp = goodp*ave(goodp, c(0L, cumsum(diff(goodp != 0)), FUN = seq_along, by = itemcode)))

When I try to run this code it's very very slow. I get that transform is part of the reason why (the 'by' doesn't help either). There are over 70K different values for the itemcode variable, so it should probably be vectorized. Is there a way to vectorize this, using cumsum? If not, any help whatsoever would be truly appreciated. Thanks so much.

Zyrian answered 9/3, 2016 at 17:27 Comment(8)
Can you show the expected output please?Rely
@akrun it is an r questionZyrian
Perhaps you are looking for transform(df, cum.goodp = ave(goodp, itemcode, FUN = cumsum)) but it's really not clear to me..Rely
What about dt[,cum_goodp := cumsum(goodp), by = "itemcode"] where dt <- data.table(df)? Your transform(...) call returned an error for me so I'm not sure what the desired output looks like.Ligialignaloes
@docendodiscimus i have just edited to include desired output. the code i list worked before. I will look into it now.Zyrian
@Zyrian in that case you could use the code from my comment above. It produces the expected outputRely
@docendodiscimus thanks. slight problem with the code/output i put in above. i didn't allow it to account for resetting to zero. the code you provided doesn't allow for it. do you know how to reset it? i edited the output to reflect this.Zyrian
@jvalenti, then you can use transform(df, cum.goodpX = ave(goodp, itemcode, cumsum(goodp == 0), FUN = cumsum))Rely
R
3

With the modified example input/output you could use the following base R approach (among others):

transform(df, cum.goodpX = ave(goodp, itemcode, cumsum(goodp == 0), FUN = cumsum))
#   itemcode goodp cum.goodp cum.goodpX
#1        a1     0         0          0
#2        a1     1         1          1
#3        a1     1         2          2
#4        a1     0         0          0
#5        a1     1         1          1
#6        a2     1         1          1
#7        a2     1         2          2
#8        a3     0         0          0
#9        a4     0         0          0
#10       a4     1         1          1
#11       a5     1         1          1
#12       a6     1         1          1
#13       a6     1         2          2
#14       a6     0         0          0
#15       a6     1         1          1

Note: I added column cum.goodp to the input df and created a new column cum.goodpX so you can easily compare the two.

But of course you can use many other approaches with packages, either what @MartinMorgan suggested or for example using dplyr or data.table, to name just two options. Those may be a lot faster than base R approaches for large data sets.

Here's how it would be done in dplyr:

library(dplyr)
df %>% 
   group_by(itemcode, grp = cumsum(goodp == 0)) %>% 
   mutate(cum.goodpX = cumsum(goodp))

A data.table option was already provided in the comments to your question.

Rely answered 9/3, 2016 at 18:31 Comment(0)
D
11

A base R approach is to calculate cumsum over the whole vector, and capture the geometry of the sub-lists using run-length encoding. Figure out the start of each group, and create new groups

start <- c(TRUE, itemcode[-1] != itemcode[-length(itemcode)]) | !goodp
f <- cumsum(start)

Summarize these as a run-length encoding, and calculate the overall sum

r <- rle(f)
x <- cumsum(x)

Then use the geometry to get the offset that each embedded sum needs to be corrected by

offset <- c(0, x[cumsum(r$lengths)])

and calculate the updated value

x - rep(offset[-length(offset)], r$lengths)

Here's a function

cumsumByGroup <- function(x, f) {
    start <- c(TRUE, f[-1] != f[-length(f)]) | !x
    r <- rle(cumsum(start))
    x <- cumsum(x)
    offset <- c(0, x[cumsum(r$lengths)])
    x - rep(offset[-length(offset)], r$lengths)
}

Here's the result applied to the sample data

> cumsumByGroup(goodp, itemcode)
 [1] 0 1 2 0 1 1 2 0 0 1 1 1 2 0 1

and it's performance

> n <- 1 + rpois(1000000, 1)
> goodp <- sample(c(0, 1), sum(n), TRUE)
> itemcode <- rep(seq_along(n), n)
> system.time(cumsumByGroup(goodp, itemcode))
   user  system elapsed 
   0.55    0.00    0.55 

The dplyr solution takes about 70s.

@alexis_laz solution is both elegant and 2 times faster than mine

cumsumByGroup1 <- function(x, f) {
    start <- c(TRUE, f[-1] != f[-length(f)]) | !x
    cs = cumsum(x)
    cs - cummax((cs - x) * start)
}
Dissidence answered 9/3, 2016 at 17:50 Comment(1)
Unless there is a caveat with all 0's and 1's, a similar approach could be: cs = cumsum(x); cs - cummax((cs - x) * start)Glaucescent
R
3

With the modified example input/output you could use the following base R approach (among others):

transform(df, cum.goodpX = ave(goodp, itemcode, cumsum(goodp == 0), FUN = cumsum))
#   itemcode goodp cum.goodp cum.goodpX
#1        a1     0         0          0
#2        a1     1         1          1
#3        a1     1         2          2
#4        a1     0         0          0
#5        a1     1         1          1
#6        a2     1         1          1
#7        a2     1         2          2
#8        a3     0         0          0
#9        a4     0         0          0
#10       a4     1         1          1
#11       a5     1         1          1
#12       a6     1         1          1
#13       a6     1         2          2
#14       a6     0         0          0
#15       a6     1         1          1

Note: I added column cum.goodp to the input df and created a new column cum.goodpX so you can easily compare the two.

But of course you can use many other approaches with packages, either what @MartinMorgan suggested or for example using dplyr or data.table, to name just two options. Those may be a lot faster than base R approaches for large data sets.

Here's how it would be done in dplyr:

library(dplyr)
df %>% 
   group_by(itemcode, grp = cumsum(goodp == 0)) %>% 
   mutate(cum.goodpX = cumsum(goodp))

A data.table option was already provided in the comments to your question.

Rely answered 9/3, 2016 at 18:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.