Create counter within consecutive runs of values
Asked Answered
C

3

28

I wish to create a sequential number within each run of equal values, like a counter of occurrences, which restarts once the value in the current row is different from the previous row.

Please find an example of input and expected output below.

dataset <- data.frame(input = c("a","b","b","a","a","c","a","a","a","a","b","c"))
dataset$counter <- c(1,1,2,1,2,1,1,2,3,4,1,1)
dataset

#    input counter
# 1      a       1
# 2      b       1
# 3      b       2
# 4      a       1
# 5      a       2
# 6      c       1
# 7      a       1
# 8      a       2
# 9      a       3
# 10     a       4
# 11     b       1
# 12     c       1

My question is very similar to this one: Cumulative sequence of occurrences of values.

Clamatorial answered 15/11, 2013 at 10:25 Comment(0)
E
56

You need to use sequence and rle:

> sequence(rle(as.character(dataset$input))$lengths)
 [1] 1 1 2 1 2 1 1 2 3 4 1 1
Enterotomy answered 15/11, 2013 at 10:27 Comment(3)
Cheers, that works like a charm! How do you know about the $lengths part? Are there other properties? (Don't see them in R Docs).Clamatorial
@Richard, see the "Value" section of the documentation for ?rle. The two values returned (in a list of class "rle") are lengths and values.Enterotomy
works nicely with group_by(), too.Erine
C
29

And from v1.9.8 (NEWS item 16), using rowid with rleid

dataset[, counter := rowid(rleid(input))]

timing code:

set.seed(1L)
library(data.table)
DT <- data.table(input=sample(letters, 1e6, TRUE))
DT1 <- copy(DT)

bench::mark(DT[, counter := seq_len(.N), by=rleid(input)], 
    DT1[, counter := rowid(rleid(input))])

timings:

  expression                                              min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr>                                          <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 DT[, `:=`(counter, seq_len(.N)), by = rleid(input)] 613.8ms 613.8ms      1.63    18.8MB     8.15     1     5      614ms
2 DT1[, `:=`(counter, rowid(rleid(input)))]            60.5ms  71.4ms     12.7     26.4MB    14.5      7     8      553ms

An efficient and more straightforward version of the function written below is available now in data.table package, called rleid. Using that, it's just:

setDT(dataset)[, counter := seq_len(.N), by=rleid(input)]

See ?rleid for more on usage and examples. Thanks to @Henrik for the suggestion to update this post.


rle is definitely the most convenient way to do it (+1 @Ananda's). But one could do better (in terms of speed) on bigger data. You can use the duplist and vecseq functions (not exported) from data.table as follows:

require(data.table)
arun <- function(y) {
    w = data.table:::duplist(list(y))
    w = c(diff(w), length(y)-tail(w,1L)+1L)
    data.table:::vecseq(rep(1L, length(w)), w, length(y))
}

x <- c("a","b","b","a","a","c","a","a","a","a","b","c")
arun(x)
# [1] 1 1 2 1 2 1 1 2 3 4 1 1

Benchmarking on big data:

set.seed(1)
x <- sample(letters, 1e6, TRUE)
# rle solution
ananda <- function(y) {
    sequence(rle(y)$lengths)
}

require(microbenchmark)
microbenchmark(a1 <- arun(x), a2<-ananda(x), times=100)
Unit: milliseconds
            expr       min        lq    median       uq       max neval
   a1 <- arun(x)  123.2827  132.6777  163.3844  185.439  563.5825   100
 a2 <- ananda(x) 1382.1752 1899.2517 2066.4185 2247.233 3764.0040   100

identical(a1, a2) # [1] TRUE
Contagion answered 15/11, 2013 at 10:48 Comment(5)
@Arun, thanks this is a somewhat smaller dataset I am working on, but it will definitely come in handy in the future! I am sorry, I can only accept one answer! :(Clamatorial
Hi @Contagion - I think duplist is not there in latest version of ‘data.table’ version 1.9.2Forkey
Seems this code does not work anymore with most recent version of data.table? Thanks!Jackscrew
and also another mtd: setDT(dataset)[, cnt := rowid(rleid(input))]Slub
@Slub much nicer! Feel free to edit it into the answer.Contagion
P
5

Package runner has dedicated solution to compute what needed. streak_run is the fastest solution and accepts vector as input.

library(microbenchmark)
library(runner)

x      <- sample(letters, 1e6, TRUE)
ananda <- function(y) sequence(rle(y)$lengths)

microbenchmark(
  a2 <- ananda(x), 
  runner <- streak_run(x), 
  times=100
)

#Unit: milliseconds
#                expr     min      lq     mean  median       uq      max neval
#     a2 <- ananda(x) 580.744 718.117 1059.676 944.073 1399.649 1699.293    10
#run <- streak_run(x)  37.682  39.568   42.277  40.591   43.947   52.917    10

identical(a2, run)
#[1] TRUE
Panoptic answered 18/9, 2018 at 10:31 Comment(5)
is this package still available? I can't seem to download itGenovese
yes, it is. Use install.packages("runner"). What system you use? I just checked on Linux and MacOS and it worksPanoptic
I have R 3.4.1 on windows. When I try to install it says package ‘runner’ is not available (for R version 3.4.1)Genovese
Try to update to newest R version or install from github devtools::install_github("gogonzo/runner")Panoptic
that must be the problem. It's a work computer and I can't get devtools. I'm waiting for my R to get updated and then hopefully I can get the packageGenovese

© 2022 - 2024 — McMap. All rights reserved.