Cumulative sum until maximum reached, then repeat from zero in the next row

Asked 17/3, 2013 at 22:15 Answered 18/3, 2013 at 2:18

I feel like this is a fairly easy question, but for the life of me I can't seem to find the answer. I have a fairly standard dataframe, and what I am trying to do is sum the a column of values until they reach some value (either that exact value or greater than it), at which point it drops a 1 into a new column (labelled keep) and restarts the summing at 0.

I have a column of minutes, the differences between the minutes, a keep column, and a cumulative sum column (the example I am using is much cleaner than the actual full dataset)

 minutes     difference     keep     difference_sum
 1052991158       0          0            0
 1052991338      180         0            180
 1052991518      180         0            360
 1052991698      180         0            540
 1052991878      180         0            720
 1052992058      180         0            900
 1052992238      180         0            1080
 1052992418      180         0            1260
 1052992598      180         0            1440
 1052992778      180         0            1620
 1052992958      180         0            1800

The difference sum column was calculated with the code

caribou.sub$difference_sum<-cumsum(difference)

What I would like to do is run the above code with the condition that, when the summed value reaches either 1470 or any number greater than that it puts a 1 in the keep column and then restarts summing afterwards, and continues running throughout the dataset.

Thanks in advance, and if you need any more information let me know.

Ayden

Communication answered 17/3, 2013 at 22:15 Comment(5)

Does difference_sum reset to 0 when 1470 is reached? A slightly longer example set including when difference_sum crosses the threshold would help. – Lipetsk 17/3, 2013 at 22:19

No, thats what I'm trying to do, the difference sum column is currently calculated with the caribou.sub$difference_sum<-cumsum(difference) code. It just keeps going and going all the way through the dataset. – Communication 17/3, 2013 at 22:20

OK, but after you've crossed the threshold, how do you go about calculating the next threshold? Do you use the surplus over 1470, or start at 0 from the next row? – Lipetsk 17/3, 2013 at 22:22

Oooh, gotcha. The next row would reset at 0 as each point (at least, thats what I would want it to do, there would be surplus but would have to be thrown out). Henrik got it pretty much spot on below. Thanks for helping clear things up. – Communication 17/3, 2013 at 22:32

But I think I will expand it like you asked, in case anyone else stumbles across here. – Communication 17/3, 2013 at 22:33

I think this is best done with a for loop, can't think of a function that could do so out of the box. The following should do what you want (if I understand you correctly).

current.sum <- 0
for (c in 1:nrow(caribou.sub)) {
    current.sum <- current.sum + caribou.sub[c, "difference"]
    carribou.sub[c, "difference_sum"] <- current.sum
    if (current.sum >= 1470) {
        caribou.sub[c, "keep"] <- 1
        current.sum <- 0
    }
}

Feel free to comment if it does not exactly what you want. But as pointed out by alexwhan, your description is not completely clear.

Valentijn answered 17/3, 2013 at 22:22 Comment(0)

Assuming your data.frame is df:

df$difference_sum <- c(0, head(cumsum(df$difference), -1))
# get length of 0's (first keep value gives the actual length)
len <- sum(df$difference_sum %/% 1470 == 0)
df$keep <- (seq_len(nrow(df))-1) %/% len
df <- transform(df, difference_sum = ave(difference, keep, 
          FUN=function(x) c(0, head(cumsum(x), -1))))

#       minutes difference keep difference_sum
# 1  1052991158        180    0              0
# 2  1052991338        180    0            180
# 3  1052991518        180    0            360
# 4  1052991698        180    0            540
# 5  1052991878        180    0            720
# 6  1052992058        180    0            900
# 7  1052992238        180    0           1080
# 8  1052992418        180    0           1260
# 9  1052992598        180    0           1440
# 10 1052992778        180    1              0
# 11 1052992958        180    1            180

Pentastich answered 17/3, 2013 at 22:36 Comment(10)

This is exactly where I was going. @heidelbergslide - this will be markedly faster than the loop – Lipetsk 17/3, 2013 at 22:37

No "repeat from zero" here. It will diverge from the other answer. – Magnetostriction 17/3, 2013 at 22:39

@MatthewLundberg, you mean the op is interested in the right cumsum values as well? I thought it was just to compute keep? – Pentastich 17/3, 2013 at 22:40

In any case, this should do it, incase OP wants the right cumsum values as well. – Pentastich 17/3, 2013 at 22:43

Yeah, what this is a list of GPS location points. I want to select the first point (point B) that is greater than 24.5 hours away from point A, omitting all points before that, and then select the next point 24.5 hours away from point B, omitting all others, then etc etc. – Communication 17/3, 2013 at 22:43

But the dataset is messy, and so points may fall every 3 hours like the example, or once or twice a week, the example here would give me all values greater than 1470. – Communication 17/3, 2013 at 22:45

@Pentastich this looks correct when there is only one "reset to zero", but if you have enough of them, you'll be off by a row. – Magnetostriction 17/3, 2013 at 22:46

@MatthewLundberg, yes, I see the problem. I'll work on a fix. HeidelbergSlide, I've no idea what you're talking about. I'm trying to provide the code for this question that you've posted. – Pentastich 17/3, 2013 at 22:51

I was trying to give it the background on what the dataset is in real life an attempt to help explain it. I'm pretty crappy at explaining these things... – Communication 17/3, 2013 at 22:52

@MatthewLundberg, this should do the trick. HeidelbergSlide, try this out. – Pentastich 17/3, 2013 at 23:7

I still don't understand about when the sum should restart and if it should be zero then. A desired result would help greatly.

Nonetheless, I can't help but think that simply indexing and subtraction would be a straightforward way of doing this. The below code gives the same result as @Henrik's solution.

df$difference_sum <- cumsum(df$difference)
step <- (df$difference_sum %/% 1470) + 1
k <- which(diff(step) > 0) + 1
df$keep <- 0
df$keep[k] <- 1
step[k] <- step[k] - 1
df$difference_sum <- df$difference_sum - c(0, df$difference_sum[k])[step]

Rowel answered 18/3, 2013 at 2:18 Comment(2)

This is really close, but because (as I understand this, and I may be very wrong) you are using the cumsum of the entire difference column the excess from the previous selection is incorporated into the next selection, so it happens one row too soon (the first selected value is at minute 1620 but that leaves 150 minutes that should be ignored but are used for the next selection, so the next selection happens at minute 1440 (because the cumsum says its 150 minutes more than it actually is)). Does that make sense? Thanks for helping out! – Communication 18/3, 2013 at 18:24

Oh, I see. Yes, I think that is correct (meaning that my answer is not quite right). Again, a larger example with desired output would be very helpful, though it sounds like your issue is solved, so may not be worth the bother at this point. – Rowel 18/3, 2013 at 19:22

Recommended topics

Hot tags