number rows by variable, but start over when condition is hit
Asked Answered
T

1

7

I want to number certain combinations of row in a dataframe (which is ordered on ID and on Time)

tc <- textConnection('
id              time       end_yn
abc             10         0
abc             11         0
abc             12         1
abc             13         0
def             10         0
def             15         1
def             16         0
def             17         0
def             18         1
')

test <- read.table(tc, header=TRUE)

The goal is to create a new column ("number") that numbers each row per id from 1 to n until end_yn == 1 is hit. After end_yn == 1, the numbering should start over.

Without taking the end_yn == 1 condition into account the rows can be numbered using:

DT <- data.table(test)
DT[, id := seq_len(.N), by = id]

However the expected outcome should be:

id              time       end_yn   number
abc             10         0        1
abc             11         0        2
abc             12         1        3 
abc             13         0        1 
def             10         0        1
def             15         1        2
def             16         0        1
def             17         0        2
def             18         1        3

How to incorporate the end_yn == 1 condition?

Terramycin answered 19/10, 2012 at 7:57 Comment(0)
G
5

I'm guessing there are different ways to do this, but here's one:

DT[, cEnd := c(0,cumsum(end_yn)[-.N])] # carry the end value forward

DT[, number := seq_len(.N), by = "id,cEnd"] # create your sequence

DT[, cEnd := NULL] # remove the column created above

Setting id as the key for DT might be worth while.

Geist answered 19/10, 2012 at 8:41 Comment(4)
Very smart to carry the end value forward. This is exactly what I need and works much faster than my initial solution. Thanks!Terramycin
What if I would want to give all entries the same number until the end_yn condition is hit? So give number the value 1 until end_yn=1 for the first time, then 2 until end_yn=1 for the second time, then 3 etc. (per coockie_id). seq_len(.N) needs to be replaced I think, but I cannot figure out with whatTerramycin
@MaxvanderHeijden, for that, you could try something like DT[, number := cumsum(end_yn) + 1, by = "id"], but that would start numbering at 2 if end_yn is 1 for the first entry of an id. Try searching the data.table tag on SO, since a similar question may very well have been asked already.Geist
Thank you. Of course I could also use the cEnd variable, since that essentially also makes a unique ID per journey. However not on a per ID basis, but that I don't really need after changing some other commands :)Terramycin

© 2022 - 2024 — McMap. All rights reserved.