I am trying to create a column in a very large data frame (~ 2.2 million rows) that calculates the cumulative sum of 1's for each factor level, and resets when a new factor level is reached. Below is some basic data that resembles my own.
itemcode <- c('a1', 'a1', 'a1', 'a1', 'a1', 'a2', 'a2', 'a3', 'a4', 'a4', 'a5', 'a6', 'a6', 'a6', 'a6')
goodp <- c(0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1)
df <- data.frame(itemcode, goodp)
I would like the output variable, cum.goodp, to look like this:
cum.goodp <- c(0, 1, 2, 0, 1, 1, 2, 0, 0, 1, 1, 1, 2, 0, 1)
I get that there is a lot out there using the canonical split-apply-combine approach, which, conceptually is intuitive, but I tried using the following:
k <- transform(df, cum.goodp = goodp*ave(goodp, c(0L, cumsum(diff(goodp != 0)), FUN = seq_along, by = itemcode)))
When I try to run this code it's very very slow. I get that transform is part of the reason why (the 'by' doesn't help either). There are over 70K different values for the itemcode variable, so it should probably be vectorized. Is there a way to vectorize this, using cumsum? If not, any help whatsoever would be truly appreciated. Thanks so much.
transform(df, cum.goodp = ave(goodp, itemcode, FUN = cumsum))
but it's really not clear to me.. – Relydt[,cum_goodp := cumsum(goodp), by = "itemcode"]
wheredt <- data.table(df)
? Yourtransform(...)
call returned an error for me so I'm not sure what the desired output looks like. – Ligialignaloestransform(df, cum.goodpX = ave(goodp, itemcode, cumsum(goodp == 0), FUN = cumsum))
– Rely