How to remove rows with zero in ONLY top and bottom of dataset [duplicate]
Asked Answered
V

4

7

Consider the following example data frame

df=data.frame(x=c(0,3,5,0,7,6,0),y=c(0,0,3,0,0,4,0),z=c(8,7,6,8,9,4,3))

example dataframe

I want to remove the first and last rows, where y=0, that is row 1,2 and 7 - without removing row 4 and 5.

I can filter out any row with zero using filter(!y==0) and can see slice_head() and slice_tail() recommended for deleting based on location (n=). I am looking for a way to conditionally remove head and tail based on y.

The full dataset consists of 200,000 rows with data collected across dates and id's. I will be applying this per day and id, using group_by(id,date). The length of the head and tail with zeros varries across dates, thus I cannot use slice_head(n=2).

I am working in tidyverse (mainly/so far).

Thanks in advance :)

Ventricle answered 13/3 at 8:22 Comment(0)
M
9

Using Position() to avoid scanning the whole vector:

df[Position(\(x) x!=0, df$y):Position(\(x) x!=0, df$y, right = TRUE), ]
#   x y z
# 3 5 3 6
# 4 0 0 8
# 5 7 0 9
# 6 6 4 4

One dplyr option:

library(dplyr)
 df |>
  #group_by(id, date)
  slice(foo(y))

Where

foo <- function(vec) Position(\(x) x!=0, vec):Position(\(x) x!=0, vec, right = TRUE)

Example of performance gain with Position when a big part of the vector can be skipped (just showing in isolation that there can be gains with Position).

set.seed(10)
x <- sample(c(0,1), prob = c(0.99,0.01), size = 10e4, replace =TRUE)
microbenchmark::microbenchmark(
  head(which(x!=0),1), 
  head(which(cumsum(!!x) > 0), 1),
  Position = Position(\(x) x!=0, x),
  Position2 = (\(pred, x) for (i in seq_along(x)) if (pred(x[i])) return(i))(\(x) x!=0, x)
) 
# Unit: microseconds
#                             expr   min      lq     mean  median      uq    max neval
#           head(which(x != 0), 1) 327.8  332.10  386.538  342.65  405.75  966.7   100
#  head(which(cumsum(!!x) > 0), 1) 993.8 1024.95 1311.003 1077.95 1219.60 8659.3   100
#                         Position  63.1   65.00   78.374   68.15   71.15  719.2   100
#                        Position2  62.4   63.75   97.533   65.40   68.35 2881.3   100 
Myoglobin answered 13/3 at 8:33 Comment(5)
wow, this is my first time to see such a fun function Position, cheers!Teresetereshkova
Definitely undervalued I would say!Myoglobin
Added benchmark answer below, feel free to edit. Position is a winner so far. Good find, first time I see it in use.Sha
@Sha Thanks. It's still not winning by much. The inefficiency with my Position solution is creating the indices with :.Myoglobin
I see, just noticed your benchmark, you are getting the 1st and not the range. Indeed, ":" is slowing it down by 4-5x.Sha
T
3

You can try cumsum + rev (for code golfing if you are interested)

> subset(df, cumsum(!!y) & rev(cumsum(rev(!!y))))
  x y z
3 5 3 6
4 0 0 8
5 7 0 9
6 6 4 4

or to avoid the !! ambiguity (thanks for @moodymudskipper's comment)

> subset(df, cumsum(y != 0) & rev(cumsum(rev(y) != 0)))
  x y z
3 5 3 6
4 0 0 8
5 7 0 9
6 6 4 4

where

  • cumsum indicates the positions from first non-zero value
  • rev searches from tail to head

Another option is using findInterval + range

> subset(df, findInterval(seq_along(y), range(which(!!y)), rightmost.closed = TRUE) == 1)
  x y z
3 5 3 6
4 0 0 8
5 7 0 9
6 6 4 4
Teresetereshkova answered 13/3 at 8:31 Comment(4)
Why not cumsum(y) != 0 & ... ?Regretful
@Regretful I found a counter example that cumsum(y)!=0 won't work: df <- data.frame(x = c(0, 3, 5, 0, 7, 6, 0), y = c(0, 0, 1, -1, 0, 4, 0), z = c(8, 7, 6, 8, 9, 4, 3))Teresetereshkova
ah, yes indeed. We could do this to spare a couple characters and avoid the !! ambiguity : `subset(df, cumsum(y != 0) & rev(cumsum(rev(y) != 0)))Regretful
@Regretful yes, that makes sense to avoid !! ambiguity. I added yours in the solutionTeresetereshkova
S
3

Find 1st and last non-zero value for column y, then subset with range first:last:

ix <- which(df$y != 0)
df[ head(ix, 1):tail(ix, 1), ]
#   x y z
# 3 5 3 6
# 4 0 0 8
# 5 7 0 9
# 6 6 4 4
Sha answered 13/3 at 8:59 Comment(0)
S
2

Benchmark:

microbenchmark::microbenchmark(
  cumsum1 = { 
    subset(dfbig, cumsum(!!y) > 0 & rev(cumsum(rev(!!y)) > 0)) 
  },
  cumsum2 = { 
    subset(dfbig, cumsum(y) > 0 & rev(cumsum(rev(y)) > 0))
  },
  findInterval ={
    subset(dfbig, findInterval(seq_along(y), range(which(!!y)), rightmost.closed = TRUE) == 1)
  },
  position = {
    dfbig[Position(\(x) x!=0, dfbig$y):Position(\(x) x!=0, dfbig$y, right = TRUE), ]  
  },
  which = {
    ix <- which(dfbig$y != 0)
    dfbig[ head(ix, 1):tail(ix, 1), ]  
  },
  setup = { 
    set.seed(1)
    dfbig <- data.frame(y = sample(0:10, 1e6, replace = TRUE))
  },
  unit = "relative")

Unit: relative
         expr      min       lq     mean   median       uq       max neval
      cumsum1 7.385675 9.118432 7.714525 9.184558 8.294421  7.614359   100
      cumsum2 6.074707 7.346813 6.502870 7.456071 7.017192  7.139040   100
 findInterval 6.272723 7.081025 5.978390 7.425544 6.604673 10.652001   100
     position 1.000000 1.000000 1.000000 1.000000 1.000000  1.000000   100
        which 1.144340 1.155399 1.367923 1.178065 1.102273  5.456092   100
Sha answered 13/3 at 8:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.