How to remove rows with zero in ONLY top and bottom of dataset [duplicate]

Asked 13/3 at 8:22 Answered 13/3 at 8:59

Consider the following example data frame

df=data.frame(x=c(0,3,5,0,7,6,0),y=c(0,0,3,0,0,4,0),z=c(8,7,6,8,9,4,3))

I want to remove the first and last rows, where y=0, that is row 1,2 and 7 - without removing row 4 and 5.

I can filter out any row with zero using filter(!y==0) and can see slice_head() and slice_tail() recommended for deleting based on location (n=). I am looking for a way to conditionally remove head and tail based on y.

The full dataset consists of 200,000 rows with data collected across dates and id's. I will be applying this per day and id, using group_by(id,date). The length of the head and tail with zeros varries across dates, thus I cannot use slice_head(n=2).

I am working in tidyverse (mainly/so far).

Thanks in advance :)

Ventricle answered 13/3 at 8:22 Comment(0)

Using Position() to avoid scanning the whole vector:

df[Position(\(x) x!=0, df$y):Position(\(x) x!=0, df$y, right = TRUE), ]
#   x y z
# 3 5 3 6
# 4 0 0 8
# 5 7 0 9
# 6 6 4 4

One dplyr option:

library(dplyr)
 df |>
  #group_by(id, date)
  slice(foo(y))

Where

foo <- function(vec) Position(\(x) x!=0, vec):Position(\(x) x!=0, vec, right = TRUE)

Example of performance gain with Position when a big part of the vector can be skipped (just showing in isolation that there can be gains with Position).

set.seed(10)
x <- sample(c(0,1), prob = c(0.99,0.01), size = 10e4, replace =TRUE)
microbenchmark::microbenchmark(
  head(which(x!=0),1), 
  head(which(cumsum(!!x) > 0), 1),
  Position = Position(\(x) x!=0, x),
  Position2 = (\(pred, x) for (i in seq_along(x)) if (pred(x[i])) return(i))(\(x) x!=0, x)
) 
# Unit: microseconds
#                             expr   min      lq     mean  median      uq    max neval
#           head(which(x != 0), 1) 327.8  332.10  386.538  342.65  405.75  966.7   100
#  head(which(cumsum(!!x) > 0), 1) 993.8 1024.95 1311.003 1077.95 1219.60 8659.3   100
#                         Position  63.1   65.00   78.374   68.15   71.15  719.2   100
#                        Position2  62.4   63.75   97.533   65.40   68.35 2881.3   100

Myoglobin answered 13/3 at 8:33 Comment(5)

wow, this is my first time to see such a fun function Position, cheers! – Teresetereshkova 13/3 at 8:36

Definitely undervalued I would say! – Myoglobin 13/3 at 8:37

Added benchmark answer below, feel free to edit. Position is a winner so far. Good find, first time I see it in use. – Sha 13/3 at 9:19

@Sha Thanks. It's still not winning by much. The inefficiency with my Position solution is creating the indices with :. – Myoglobin 13/3 at 9:21

I see, just noticed your benchmark, you are getting the 1st and not the range. Indeed, ":" is slowing it down by 4-5x. – Sha 13/3 at 9:24

You can try cumsum + rev (for code golfing if you are interested)

> subset(df, cumsum(!!y) & rev(cumsum(rev(!!y))))
  x y z
3 5 3 6
4 0 0 8
5 7 0 9
6 6 4 4

or to avoid the !! ambiguity (thanks for @moodymudskipper's comment)

> subset(df, cumsum(y != 0) & rev(cumsum(rev(y) != 0)))
  x y z
3 5 3 6
4 0 0 8
5 7 0 9
6 6 4 4

where

cumsum indicates the positions from first non-zero value
rev searches from tail to head

Another option is using findInterval + range

> subset(df, findInterval(seq_along(y), range(which(!!y)), rightmost.closed = TRUE) == 1)
  x y z
3 5 3 6
4 0 0 8
5 7 0 9
6 6 4 4

Teresetereshkova answered 13/3 at 8:31 Comment(4)

Why not cumsum(y) != 0 & ... ? – Regretful 13/3 at 9:3

@Regretful I found a counter example that cumsum(y)!=0 won't work: df <- data.frame(x = c(0, 3, 5, 0, 7, 6, 0), y = c(0, 0, 1, -1, 0, 4, 0), z = c(8, 7, 6, 8, 9, 4, 3)) – Teresetereshkova 13/3 at 9:30

ah, yes indeed. We could do this to spare a couple characters and avoid the !! ambiguity : `subset(df, cumsum(y != 0) & rev(cumsum(rev(y) != 0))) – Regretful 13/3 at 14:16

@Regretful yes, that makes sense to avoid !! ambiguity. I added yours in the solution – Teresetereshkova 13/3 at 14:34

Find 1st and last non-zero value for column y, then subset with range first:last:

ix <- which(df$y != 0)
df[ head(ix, 1):tail(ix, 1), ]
#   x y z
# 3 5 3 6
# 4 0 0 8
# 5 7 0 9
# 6 6 4 4

Sha answered 13/3 at 8:59 Comment(0)

Benchmark:

microbenchmark::microbenchmark(
  cumsum1 = { 
    subset(dfbig, cumsum(!!y) > 0 & rev(cumsum(rev(!!y)) > 0)) 
  },
  cumsum2 = { 
    subset(dfbig, cumsum(y) > 0 & rev(cumsum(rev(y)) > 0))
  },
  findInterval ={
    subset(dfbig, findInterval(seq_along(y), range(which(!!y)), rightmost.closed = TRUE) == 1)
  },
  position = {
    dfbig[Position(\(x) x!=0, dfbig$y):Position(\(x) x!=0, dfbig$y, right = TRUE), ]  
  },
  which = {
    ix <- which(dfbig$y != 0)
    dfbig[ head(ix, 1):tail(ix, 1), ]  
  },
  setup = { 
    set.seed(1)
    dfbig <- data.frame(y = sample(0:10, 1e6, replace = TRUE))
  },
  unit = "relative")

Unit: relative
         expr      min       lq     mean   median       uq       max neval
      cumsum1 7.385675 9.118432 7.714525 9.184558 8.294421  7.614359   100
      cumsum2 6.074707 7.346813 6.502870 7.456071 7.017192  7.139040   100
 findInterval 6.272723 7.081025 5.978390 7.425544 6.604673 10.652001   100
     position 1.000000 1.000000 1.000000 1.000000 1.000000  1.000000   100
        which 1.144340 1.155399 1.367923 1.178065 1.102273  5.456092   100

Sha answered 13/3 at 8:22 Comment(0)

Recommended topics

Hot tags