Filter for complete cases in data.frame using dplyr (case-wise deletion)
Asked Answered
L

7

128

Is it possible to filter a data.frame for complete cases using dplyr? complete.cases with a list of all variables works, of course. But that is a) verbose when there are a lot of variables and b) impossible when the variable names are not known (e.g. in a function that processes any data.frame).

library(dplyr)
df = data.frame(
    x1 = c(1,2,3,NA),
    x2 = c(1,2,NA,5)
)

df %.%
  filter(complete.cases(x1,x2))
Lowrey answered 12/3, 2014 at 13:50 Comment(3)
complete.cases doesn't just accept vectors. It takes whole data frames, as well.Labonte
But that doesn't work as part of dplyr's filter function. I guess I wasn't clear enough and updated my question.Lowrey
It would help if you could demonstrate exactly how it doesn't work with dplyr, but when I try it with filter, it works just fine.Labonte
S
235

Try this:

df %>% na.omit

or this:

df %>% filter(complete.cases(.))

or this:

library(tidyr)
df %>% drop_na

If you want to filter based on one variable's missingness, use a conditional:

df %>% filter(!is.na(x1))

or

df %>% drop_na(x1)

Other answers indicate that of the solutions above na.omit is much slower but that has to be balanced against the fact that it returns row indices of the omitted rows in the na.action attribute whereas the other solutions above do not.

str(df %>% na.omit)
## 'data.frame':   2 obs. of  2 variables:
##  $ x1: num  1 2
##  $ x2: num  1 2
##  - attr(*, "na.action")= 'omit' Named int  3 4
##    ..- attr(*, "names")= chr  "3" "4"

ADDED Have updated to reflect latest version of dplyr and comments.

ADDED Have updated to reflect latest version of tidyr and comments.

Shiprigged answered 12/3, 2014 at 16:51 Comment(4)
Just came back to answer and saw your useful answer!Wien
Thanks! I added some benchmark results. na.omit() performs pretty poorly but the the one is fast.Lowrey
This works now as well: df %>% filter(complete.cases(.)). Not sure whether recent changes in dplyr made this possible.Lowrey
As @jan-katins points out, the Tidyverse function is called drop_na, so you can now do: df %>% drop_na().Dogoodism
B
27

This works for me:

df %>%
  filter(complete.cases(df))    

Or a little more general:

library(dplyr) # 0.4
df %>% filter(complete.cases(.))

This would have the advantage that the data could have been modified in the chain before passing it to the filter.

Another benchmark with more columns:

set.seed(123)
x <- sample(1e5,1e5*26, replace = TRUE)
x[sample(seq_along(x), 1e3)] <- NA
df <- as.data.frame(matrix(x, ncol = 26))
library(microbenchmark)
microbenchmark(
  na.omit = {df %>% na.omit},
  filter.anonymous = {df %>% (function(x) filter(x, complete.cases(x)))},
  rowSums = {df %>% filter(rowSums(is.na(.)) == 0L)},
  filter = {df %>% filter(complete.cases(.))},
  times = 20L,
  unit = "relative")

#Unit: relative
#             expr       min        lq    median         uq       max neval
 #         na.omit 12.252048 11.248707 11.327005 11.0623422 12.823233    20
 #filter.anonymous  1.149305  1.022891  1.013779  0.9948659  4.668691    20
 #         rowSums  2.281002  2.377807  2.420615  2.3467519  5.223077    20
 #          filter  1.000000  1.000000  1.000000  1.0000000  1.000000    20
Bombsight answered 13/5, 2014 at 11:4 Comment(4)
I updated your answer with "." in the complete.cases and added benchmark - hope you don't mind :-)Schell
I found df %>% slice(which(complete.cases(.))) performed ~ 20% faster than the filter-approach in the benchmark above.Schell
It's worth noting that if you are using this filter in a dplyr pipe with other dplyr commands (such as group_by()), you will need to add %>% data.frame() %>% before you try and filter on complete.cases(.) because it won't work on tibbles or grouped tibbles or something. Or at least, that has been the experience I have had.Touchy
I found that filter(complete.cases(.)) works with 'tibbles' but does not work with grouped 'tibbles'. You can use ungroup() if you want to keep your data in a 'tibble' rather than convert to a proper data frame.Scandent
L
18

Here are some benchmark results for Grothendieck's reply. na.omit() takes 20x as much time as the other two solutions. I think it would be nice if dplyr had a function for this maybe as part of filter.

library('rbenchmark')
library('dplyr')

n = 5e6
n.na = 100000
df = data.frame(
    x1 = sample(1:10, n, replace=TRUE),
    x2 = sample(1:10, n, replace=TRUE)
)
df$x1[sample(1:n, n.na)] = NA
df$x2[sample(1:n, n.na)] = NA


benchmark(
    df %>% filter(complete.cases(x1,x2)),
    df %>% na.omit(),
    df %>% (function(x) filter(x, complete.cases(x)))()
    , replications=50)

#                                                  test replications elapsed relative
# 3 df %.% (function(x) filter(x, complete.cases(x)))()           50   5.422    1.000
# 1               df %.% filter(complete.cases(x1, x2))           50   6.262    1.155
# 2                                    df %.% na.omit()           50 109.618   20.217
Lowrey answered 13/3, 2014 at 21:35 Comment(0)
I
13

This is a short function which lets you specify columns (basically everything which dplyr::select can understand) which should not have any NA values (modeled after pandas df.dropna()):

drop_na <- function(data, ...){
    if (missing(...)){
        f = complete.cases(data)
    } else {
        f <- complete.cases(select_(data, .dots = lazyeval::lazy_dots(...)))
    }
    filter(data, f)
}

[drop_na is now part of tidyr: the above can be replaced by library("tidyr")]

Examples:

library("dplyr")
df <- data.frame(a=c(1,2,3,4,NA), b=c(NA,1,2,3,4), ac=c(1,2,NA,3,4))
df %>% drop_na(a,b)
df %>% drop_na(starts_with("a"))
df %>% drop_na() # drops all rows with NAs
Ijssel answered 4/5, 2016 at 14:47 Comment(3)
Wouldn't be even more useful to be able to add a cutoff like 0.5 and have it process by columns? Case: eliminate variables with 50% and over missing data. Example : data[, -which(colMeans(is.na(data)) > 0.5)] It would be nice to be able to do this with tidyr.Gripping
@Gripping This would mean that the addition of more data (where a variable then has lots of NA) could fail the next step in the pipeline because a needed variable is now missing...Ijssel
Right, that makes sense.Gripping
W
7

try this

df[complete.cases(df),] #output to console

OR even this

df.complete <- df[complete.cases(df),] #assign to a new data.frame

The above commands take care of checking for completeness for all the columns (variable) in your data.frame.

Wien answered 12/3, 2014 at 13:59 Comment(3)
Thanks. I guess I wasn't clear enough though (question updated). I know about complete.cases(df) but I would like to do it with dplyr as part of the filter function. That would allow a neat integration in dplyr chains etc.Lowrey
In dplyr:::do.data.frame the statement env$. <- .data adds dot to the environment. No such statement in magrittr::"%>%"`Shiprigged
Sorry must have entered the comment in wrong place.Shiprigged
G
3

Just for the sake of completeness, dplyr::filter can be avoided altogether but still be able to compose chains just by using magrittr:extract (an alias of [):

library(magrittr)
df = data.frame(
  x1 = c(1,2,3,NA),
  x2 = c(1,2,NA,5))

df %>%
  extract(complete.cases(.), )

The additional bonus is speed, this is the fastest method among the filter and na.omit variants (tested using @Miha Trošt microbenchmarks).

Grewitz answered 23/6, 2016 at 8:9 Comment(2)
When I do the benchmark with the data by Miha Trošt, I find that using extract() is almost ten times slower than filter(). However, when I create a smaller data frame with df <- df[1:100, 1:10], the picture changes and extract() is the fastest.Duumvir
You are correct. It looks like magrittr::extract is the fastest way only when n <= 5e3 in Miha Trošt benchmark.Grewitz
E
1

dplyr >= 1.0.4

if_any and if_all are available in newer versions of dplyr to apply across-like syntax in the filter function. This could be useful if you had other variables in your dataframe that were not part of what you considered complete case. For example, if you only wanted non-missing rows in columns that start with "x":

library(dplyr)
df = data.frame(
  x1 = c(1,2,3,NA),
  x2 = c(1,2,NA,5),
  y = c(NA, "A", "B", "C")
)

df %>% 
  dplyr::filter(if_all(starts_with("x"), ~!is.na(.)))

  x1 x2    y
1  1  1 <NA>
2  2  2    A

For more information on these functions see this link.

El answered 8/2, 2021 at 21:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.