Efficiently find the first of the last 1's sequence
Asked Answered
L

11

10

I have the following vectors with 0s and 1s:

test1 <- c(rep(0,20),rep(1,5),rep(0,10),rep(1,15)) 

test1
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
                                                                          ^
test2 <- c(rep(0,8),rep(1,4),rep(0,5),rep(1,5),rep(0,6),rep(1,10),rep(0,2)) 

test2
[1] 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0
                                                            ^

I need to find the index of first 1 in the last sequence of 1s (indicated by ^ in the above code). I have a solution (below) that doesn't perform well, how could I improve the performance?

For test1 and test2, the expected output is 36 and 29, respectively.

Here is a sub-optimal solution:

temp1 <- cumsum(test1)
which(temp1==max(temp1[duplicated(temp1)&temp1!=max(temp1)]+1))[1]
[1] 36

temp2 <- cumsum(test2)
which(temp2==max(temp2[duplicated(temp2)&temp2!=max(temp2)]+1))[1]
[1] 29

Note: The length of actual vectors is ~10k.

Liddie answered 29/6, 2023 at 14:46 Comment(0)
F
18

Another way with which + diff.

idx <- which(test1 == 1)
idx[tail(which(c(0, diff(idx)) != 1), 1)]
#[1] 36
Fumigant answered 29/6, 2023 at 15:27 Comment(1)
This will fail in case there is only one 1 sequence. Try it with test1 <- c(0,1,1,0) where I would expect 2 as a result but here I get integer(0).Neurology
G
19

The data.table library has a non-exported function called data.table:::uniqlist(list(x)). Use three colons ::: to access non-exported functions. This function determines when columns of a data.frame change value and return indices of the change.

data.table:::uniqlist(list(test1))
# [1]  1 21 26 36

@Arun talks about uniqlist here: https://mcmap.net/q/861298/-determine-when-columns-of-a-data-frame-change-value-and-return-indices-of-the-change

Then I use the y[length(y)] method of finding the last item in a vector, and base ifelse() to check if the last index contains a 1, else the second to last index must contain a 1.

fx <- function(x) {
    y <- data.table:::uniqlist(list(x))
    ifelse(x[y[length(y)]] == 1, y[length(y)], y[length(y) - 1])
}
Goatsucker answered 29/6, 2023 at 23:12 Comment(1)
uniqlist is so efficient, which is far out of my expectation, +1!Harewood
B
18

Using rle:

r <- rle(test1)
ix <- max(which(r$values == 1))
sum(r$lengths[ 1:(ix - 1) ]) + 1
# [1] 36

r <- rle(test2)
ix <- max(which(r$values == 1))
sum(r$lengths[ 1:(ix - 1) ]) + 1
# [1] 29
Blowhole answered 29/6, 2023 at 15:0 Comment(1)
I'm rle's biggest fan :-) . So much so that I wrote seqle , a function which finds sequence lengths where you can specify the increment.Workman
F
18

Another way with which + diff.

idx <- which(test1 == 1)
idx[tail(which(c(0, diff(idx)) != 1), 1)]
#[1] 36
Fumigant answered 29/6, 2023 at 15:27 Comment(1)
This will fail in case there is only one 1 sequence. Try it with test1 <- c(0,1,1,0) where I would expect 2 as a result but here I get integer(0).Neurology
K
16

Run rle and then use cumsum to calculate the end positions of each run and subtract the lengths and add 1 to get the start positions and then reduce that to the runs of 1's only and finally take the last element. This gives the start position of the last run of 1's but if you wanted:

  • the end position just omit the -lengths+1
  • the last run of 0's replace the ==1 with ==0
  • the first run of 1's replace tail with head

If there are no 1's it returns a zero length numeric vector.

with(rle(test1), tail((cumsum(lengths) - lengths + 1)[values == 1], 1))
Kryska answered 29/6, 2023 at 15:1 Comment(0)
L
14

For completeness, here is the benchmark with a vector of size 30001. Feel free to update this if needed.

x <- c(rep(0,14736),rep(1,413),rep(0,830),rep(1,961),rep(0,274),rep(1,12787))


microbenchmark::microbenchmark(rle_zx8754(x),
                               rle_Grothendieck(x),
                               which_diff_Maël(x),
                               uniqlist_Viking(x),
                               while_Ritchie(x),
                               #Position_Ritchie(x),
                               #detect_index_Ritchie(x),
                               diff_Thomas(x),
                               #regex_Thomas(x),
                               #regexpr_Thomas(x),
                               times = 1000, check='equal')



Unit: microseconds
                 expr   min     lq      mean median     uq
        rle_zx8754(x) 339.5 350.45  783.9827 357.45 375.15
  rle_Grothendieck(x) 352.7 364.75  616.2324 372.60 391.75
   which_diff_Maël(x) 264.2 274.60  404.5521 279.50 292.00
   uniqlist_Viking(x)  16.7  22.30   32.1502  25.40  30.65
     while_Ritchie(x) 777.6 785.60 1021.0738 801.95 847.15
       diff_Thomas(x) 279.4 286.90  500.6373 291.20 306.35
      max neval  cld
 156630.3  1000   cd
  11196.5  1000  bc 
   7263.2  1000  b  
   3524.9  1000 a   
   6739.7  1000    d
   9435.5  1000  b 

functions:

x <- c(rep(0,14736),rep(1,413),rep(0,830),rep(1,961),rep(0,274),rep(1,12787))


rle_zx8754 <- function(x){
  r <- rle(x)
  ix <- max(which(r$values == 1))
  sum(r$lengths[ 1:(ix - 1) ]) + 1
}

which_diff_Maël <- function(x){
  idx <- which(x == 1)
  idx[tail(which(diff(idx) != 1), 1) + 1]
}

rle_Grothendieck <- function(x){
  with(rle(x), tail((cumsum(lengths) - lengths + 1)[values == 1], 1))
}

uniqlist_Viking <- function(x){
  y <- data.table:::uniqlist(list(x))
  ifelse(x[y[length(y)]] == 1, y[length(y)], y[length(y) - 1])
}

while_Ritchie <- function(x){
  l <- length(x)
  while (x[l] - x[l - 1] != 1) {
    l <- l - 1
  }
  l
}
Position_Ritchie <- function(x){
  Position(isTRUE, diff(x) == 1, right = TRUE) + 1
}

detect_index_Ritchie <- function(x){
  purrr::detect_index(diff(x) == 1, isTRUE, .dir = "backward") + 1
}

diff_Thomas <- function(x){
  max((2:length(x))[diff(x) == 1])
}

regex_Thomas <- function(x){
  nchar(sub("(.*01).*", "\\1", paste0(x, collapse = "")))
}

regexpr_Thomas <- function(x){
  attr(regexpr(".*(?<=0)1", paste0(x,collapse = ""), perl = TRUE), "match.length")
}

Liddie answered 29/6, 2023 at 16:33 Comment(0)
E
12

A simple while loop will be a (potentially very) fast approach where the sought index is towards the end of the vector.

f <- function(x) {
  l <- length(x)
  while (x[l] - x[l - 1] != 1) {
    l <- l - 1
  }
  l
}

f(test1)
[1] 36
f(test2)
[1] 29

We could also use Position() or the purrr equivalent detect_index():

Position(isTRUE, diff(test1) == 1, right = TRUE) + 1
[1] 36
purrr::detect_index(diff(test1) == 1, isTRUE, .dir = "backward") + 1
[1] 36
Ensoll answered 30/6, 2023 at 2:30 Comment(2)
I feel like this would be an obvious approach in most other languages, but in R it's really thinking outside the box. Love it.Antagonism
Yes. Why waste time looking at the start of the vector?Parcheesi
H
9

I believe you have many ways to do it, and below are some possible approaches:


  • regex approaches

You can try regex, like sub + nchar

f1 <- function(v) nchar(sub("(.*01).*", "\\1", paste0(v, collapse = "")))

or regexpr

f2 <- function(v) attr(regexpr(".*(?<=0)1", paste0(v,collapse = ""), perl = TRUE), "match.length")
  • diff approaches

Or, some other diff options, like

f3 <- function(v) tail(which(diff(v) == 1) + 1, 1)

and

f4 <- function(v) max((2:length(v))[diff(v) == 1])
Harewood answered 29/6, 2023 at 18:9 Comment(0)
N
9

Another way using rev and match.
rev reverses the vector, so that match, which returns the first hit, can be used to find the last 1 sequence.

f <- \(x) {
  . <- rev(x)
  i <- match(1, .)
  if(is.na(i)) return(NA)
  j <- match(0, tail(., -i))
  if(is.na(j)) 1
  else length(.) - i - j + 2 }

f(test1)
#[1] 36
f(test2)
#[1] 29
f(c(1,1))
#[1] 1
f(c(0,1))
#[1] 2
f(c(1,0))
#[1] 1
f(c(0,0))
#[1] NA

Or write a function using Rcpp doing the same but can iterate starting from the end.

Rcpp::cppFunction("int f2(NumericVector x) {
  auto i = x.end();
  while(i != x.begin() && *(--i) != 1.) ;
  while(i != x.begin() && *(--i) == 1.) ;
  if(*i != 1.) ++i;
  return i == x.end() || *i != 1. ? 0 : i - x.begin() + 1;
}")

f2(test1)
#[1] 36
f2(test2)
#[1] 29
f2(c(1,1))
#[1] 1
f2(c(0,1))
#[1] 2
f2(c(1,0))
#[1] 1
f2(c(0,0))
#[1] 0

Or using rev, diff and match.

f3 <- \(x) {
    i <- match(-1, diff(rev(x)))
    if(is.finite(i)) length(x) - i + 1
    else if(x[1] == 1) 1
    else NA
} 

f3(test1)
#[1] 36
f3(test2)
#[1] 29
f3(c(1,1))
#[1] 1
f3(c(0,1))
#[1] 2
f3(c(1,0))
#[1] 1
f3(c(0,0))
#[1] NA

Benchmark

uniqlist <- function(x) {  #M.Viking
  y <- data.table:::uniqlist(list(x))
  ifelse(x[y[length(y)]] == 1, y[length(y)], y[length(y) - 1]) }

which_diff <- function(x) {  #Maël
  idx <- which(x == 1)
  idx[tail(which(c(0, diff(idx)) != 1), 1)] }
# Dataset from question
x <- rep(c(0,1,0,1,0,1), c(14736,413,830,961,274,12787))
bench::mark(max_iterations = 1e5, f(x), f3(x), which_diff(x),
 uniqlist(x),  f2(x) )
#  expression         min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#  <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#1 f(x)          199.07µs  251.5µs     3412.    1.21MB    76.3   1341    30
#2 f3(x)         218.05µs 319.61µs     3144.    1.76MB   117.    1079    40
#3 which_diff(x) 155.01µs 177.53µs     5518.  954.17KB   103.    2296    43
#4 uniqlist(x)    17.04µs  17.72µs    55386.    1.36MB     4.04 27442     2
#5 f2(x)           5.61µs   6.13µs   161213.    2.49KB     6.16 78462     3

# Data with many changes between 0 and 1 and hit at end
x <- rep(c(0,1), 1e6)
bench::mark(max_iterations = 1e5, f(x), f3(x), which_diff(x),
 uniqlist(x),  f2(x) )
#  expression         min   median `itr/sec` mem_alloc `gc/sec`  n_itr  n_gc
#  <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>  <int> <dbl>
#1 f(x)           17.97ms  19.86ms      44.6   76.29MB     50.5     23    26
#2 f3(x)          28.77ms  32.78ms      25.6  114.44MB     52.9     14    29
#3 which_diff(x)  14.47ms  16.91ms      52.3   68.67MB     67.8     27    35
#4 uniqlist(x)     2.66ms      3ms     294.     7.63MB     27.8    148    14
#5 f2(x)           1.08µs   1.28µs  701103.     2.49KB     21.0 100000     3

# Data where hit is at beginning
x <- c(0,1,rep(0, 1e6))
bench::mark(max_iterations = 1e5, f(x), f3(x), which_diff(x),
 uniqlist(x),  f2(x) )
#  expression         min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#  <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#1 f(x)            4.34ms    6.6ms     131.    19.11MB     84.6    71    46
#2 f3(x)           15.1ms  18.73ms      35.9   57.24MB     75.7    18    38
#3 which_diff(x)   1.37ms   1.44ms     529.     7.63MB     93.9   265    47
#4 uniqlist(x)   470.91µs 491.54µs    1994.     1.36MB      0     997     0
#5 f2(x)         364.46µs 375.08µs    2649.     2.49KB      0    1325     0

# Data where hit is at end
x <- c(rep(0, 1e6),1,0)
bench::mark(max_iterations = 1e5, f(x), f3(x), which_diff(x),
 uniqlist(x),  f2(x) )
#  expression         min   median `itr/sec` mem_alloc `gc/sec`  n_itr  n_gc
#  <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>  <int> <dbl>
#1 f(x)           10.53ms  11.33ms      69.8   38.18MB     91.8     35    46
#2 f3(x)          14.19ms  17.18ms      37.6   57.24MB     69.3     19    35
#3 which_diff(x)   1.38ms   1.49ms     512.     7.63MB     77.9    256    39
#4 uniqlist(x)   479.76µs 491.61µs    1997.     1.36MB      0      999     0
#5 f2(x)           1.08µs   1.28µs  683440.     2.49KB     27.3 100000     4

The Rcpp function is the fastest and allocates the lowest amount of memory. Its performance depends where the match could be found.

Neurology answered 4/7, 2023 at 9:0 Comment(5)
the Rcpp is always the winner, +1! One more comment, searching from the tail suits for the specially when knowing that 1's appear in latter part of the sequence. From the time complexity perspective, the worst case for the rev approach (or similar alternatives) is having 1s only at the beginning of the sequence.Harewood
But how would I know, when starting from the beginning, that there is no 0 1 somewhere in the end? When starting from the end I know, that this is the last and don't need to go to the values until the beginning. But yes rev is already going over all values. I don't know a function like match going from back to front and allows starting from a specific index, so I wrote it in C++.Neurology
yes, we don't have prior info how 0 and 1 are distributed in the sequence. My point was that, if we start from the tail, it would be less performant on c(1,1,0,0,0,0,0) than c(0,0,0,0,0,1,1), just for example.Harewood
I don't see any way to improve this. When starting from the beginning then the needed time will not be affected by the arrangement of the data. But then I have to go to every vector content. So starting from the beginning will be only in case the match is at the beginning equal performant but in all other cases it will less performant than starting from the end. And yes, the provided data will affect the time taken during the benchmark.Neurology
No worries, I was just talking about the worst case, but it's a good thing in your answer that you take advantages of some useful features from the sequence.Harewood
I
4

May not be the best but just an alternate for easy understanding

data.frame(var1=c(rep(0,20),rep(1,5),rep(0,10),rep(1,15))) %>% 
  mutate(new=rleid(var1), row=row_number()) %>% 
  filter(var1==1 & max(new)==new) %>% 
  slice_head(n=1) %>% 
  select(row)

# output

  row
1  36

Infantry answered 29/6, 2023 at 15:56 Comment(0)
F
4

We can also use rleid from data.table:

library(data.table)

i1 <- rleid(test1)
min(which(i1 == max(i1[test1 == 1])))
# [1] 36
i1 <- rleid(test2)
min(which(i1 == max(i1[test2 == 1])))
# [1] 29
Fragonard answered 30/6, 2023 at 7:42 Comment(0)
I
3

Using data.table::rleidv()

rle_seq <- data.table::rleidv(test2)
rle_ones <- rle_seq[test2 != 0]
which_id_last <- rle_ones[length(rle_ones)]
which(rle_seq == which_id_last)[1L]
[1] 30001
Inferential answered 1/7, 2023 at 19:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.