Remove rows with all or some NAs (missing values) in data.frame
Asked Answered
A

20

1080

I'd like to remove the lines in this data frame that:

a) contain NAs across all columns. Below is my example data frame.

             gene hsap mmul mmus rnor cfam
1 ENSG00000208234    0   NA   NA   NA   NA
2 ENSG00000199674    0   2    2    2    2
3 ENSG00000221622    0   NA   NA   NA   NA
4 ENSG00000207604    0   NA   NA   1    2
5 ENSG00000207431    0   NA   NA   NA   NA
6 ENSG00000221312    0   1    2    3    2

Basically, I'd like to get a data frame such as the following.

             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0   2    2    2    2
6 ENSG00000221312    0   1    2    3    2

b) contain NAs in only some columns, so I can also get this result:

             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0   2    2    2    2
4 ENSG00000207604    0   NA   NA   1    2
6 ENSG00000221312    0   1    2    3    2
Antaeus answered 1/2, 2011 at 11:52 Comment(0)
L
1291

Also check complete.cases :

> final[complete.cases(final), ]
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0    2    2    2    2
6 ENSG00000221312    0    1    2    3    2

na.omit is nicer for just removing all NA's. complete.cases allows partial selection by including only certain columns of the dataframe:

> final[complete.cases(final[ , 5:6]),]
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0    2    2    2    2
4 ENSG00000207604    0   NA   NA    1    2
6 ENSG00000221312    0    1    2    3    2

Your solution can't work. If you insist on using is.na, then you have to do something like:

> final[rowSums(is.na(final[ , 5:6])) == 0, ]
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0    2    2    2    2
4 ENSG00000207604    0   NA   NA    1    2
6 ENSG00000221312    0    1    2    3    2

but using complete.cases is quite a lot more clear, and faster.

Lion answered 1/2, 2011 at 12:21 Comment(4)
What is the significance of the trailing comma in final[complete.cases(final),]?Baran
the complete.cases(final) returns a boolean of rows where there are no NA like (TRUE, FALSE, TRUE). The trailing comma says all columns. Thus, before the comma you filter on the rows but after the comma you column you do no filtering and ask for everythingReese
Thanks for this solution, I was not aware that we could specify columns with the complete.cases statement.Butterfly
Notice that na.omit will remove rows which contains at least one NA (i.e., it is sufficient for one column to be NA for the row to be removed). Same for complete.cases: it returns FALSE if one (or more) of the columns you pass are NAPatinated
S
319

Try na.omit(your.data.frame). As for the second question, try posting it as another question (for clarity).

Sofar answered 1/2, 2011 at 12:0 Comment(0)
W
232

tidyr has a new function drop_na:

library(tidyr)
df %>% drop_na()
#              gene hsap mmul mmus rnor cfam
# 2 ENSG00000199674    0    2    2    2    2
# 6 ENSG00000221312    0    1    2    3    2
df %>% drop_na(rnor, cfam)
#              gene hsap mmul mmus rnor cfam
# 2 ENSG00000199674    0    2    2    2    2
# 4 ENSG00000207604    0   NA   NA    1    2
# 6 ENSG00000221312    0    1    2    3    2
Whitehot answered 16/8, 2016 at 8:49 Comment(2)
What are the advantages of drop_na() over na.omit()? Faster?Pa
Notice this removes rows if one or more columns are NA (i.e., it is not necessary that all columns be NA)Patinated
C
105

I prefer following way to check whether rows contain any NAs:

row.has.na <- apply(final, 1, function(x){any(is.na(x))})

This returns logical vector with values denoting whether there is any NA in a row. You can use it to see how many rows you'll have to drop:

sum(row.has.na)

and eventually drop them

final.filtered <- final[!row.has.na,]

For filtering rows with certain part of NAs it becomes a little trickier (for example, you can feed 'final[,5:6]' to 'apply'). Generally, Joris Meys' solution seems to be more elegant.

Cristalcristate answered 2/2, 2011 at 21:58 Comment(3)
This is extremely slow. Much slower than e.g. the aforementioned complete.cases() solution. At least, in my case, on xts data.Manage
rowSum(!is.na(final)) seems better suited than apply()Propylite
Shouldn't it be final[rowSum(is.na(final)),]?Althaalthea
F
68

If you want control over how many NAs are valid for each row, try this function. For many survey data sets, too many blank question responses can ruin the results. So they are deleted after a certain threshold. This function will allow you to choose how many NAs the row can have before it's deleted:

delete.na <- function(DF, n=0) {
  DF[rowSums(is.na(DF)) <= n,]
}

By default, it will eliminate all NAs:

delete.na(final)
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0    2    2    2    2
6 ENSG00000221312    0    1    2    3    2

Or specify the maximum number of NAs allowed:

delete.na(final, 2)
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0    2    2    2    2
4 ENSG00000207604    0   NA   NA    1    2
6 ENSG00000221312    0    1    2    3    2
Functionary answered 26/5, 2015 at 14:49 Comment(2)
This is the most reliable method to remove rows, when you need at least a number of NAs to remove that row. Helped me a lot!Flodden
Please see slightly faster version below. On this occasion, thanks very much for sharing this great solution, which I have been using ever since.Latish
H
58

If performance is a priority, use data.table and na.omit() with optional param cols=.

na.omit.data.table is the fastest on my benchmark (see below), whether for all columns or for select columns (OP question part 2).

If you don't want to use data.table, use complete.cases().

On a vanilla data.frame, complete.cases is faster than na.omit() or dplyr::drop_na(). Notice that na.omit.data.frame does not support cols=.

Benchmark result

Here is a comparison of base (blue), dplyr (pink), and data.table (yellow) methods for dropping either all or select missing observations, on notional dataset of 1 million observations of 20 numeric variables with independent 5% likelihood of being missing, and a subset of 4 variables for part 2.

Your results may vary based on length, width, and sparsity of your particular dataset.

Note log scale on y axis.

enter image description here

Benchmark script

#-------  Adjust these assumptions for your own use case  ------------
row_size   <- 1e6L 
col_size   <- 20    # not including ID column
p_missing  <- 0.05   # likelihood of missing observation (except ID col)
col_subset <- 18:21  # second part of question: filter on select columns

#-------  System info for benchmark  ----------------------------------
R.version # R version 3.4.3 (2017-11-30), platform = x86_64-w64-mingw32
library(data.table); packageVersion('data.table') # 1.10.4.3
library(dplyr);      packageVersion('dplyr')      # 0.7.4
library(tidyr);      packageVersion('tidyr')      # 0.8.0
library(microbenchmark)

#-------  Example dataset using above assumptions  --------------------
fakeData <- function(m, n, p){
  set.seed(123)
  m <-  matrix(runif(m*n), nrow=m, ncol=n)
  m[m<p] <- NA
  return(m)
}
df <- cbind( data.frame(id = paste0('ID',seq(row_size)), 
                        stringsAsFactors = FALSE),
             data.frame(fakeData(row_size, col_size, p_missing) )
             )
dt <- data.table(df)

par(las=3, mfcol=c(1,2), mar=c(22,4,1,1)+0.1)
boxplot(
  microbenchmark(
    df[complete.cases(df), ],
    na.omit(df),
    df %>% drop_na,
    dt[complete.cases(dt), ],
    na.omit(dt)
  ), xlab='', 
  main = 'Performance: Drop any NA observation',
  col=c(rep('lightblue',2),'salmon',rep('beige',2))
)
boxplot(
  microbenchmark(
    df[complete.cases(df[,col_subset]), ],
    #na.omit(df), # col subset not supported in na.omit.data.frame
    df %>% drop_na(col_subset),
    dt[complete.cases(dt[,col_subset,with=FALSE]), ],
    na.omit(dt, cols=col_subset) # see ?na.omit.data.table
  ), xlab='', 
  main = 'Performance: Drop NA obs. in select cols',
  col=c('lightblue','salmon',rep('beige',2))
)
Heavy answered 16/2, 2018 at 15:41 Comment(0)
A
53

Another option if you want greater control over how rows are deemed to be invalid is

final <- final[!(is.na(final$rnor)) | !(is.na(rawdata$cfam)),]

Using the above, this:

             gene hsap mmul mmus rnor cfam
1 ENSG00000208234    0   NA   NA   NA   2
2 ENSG00000199674    0   2    2    2    2
3 ENSG00000221622    0   NA   NA   2   NA
4 ENSG00000207604    0   NA   NA   1    2
5 ENSG00000207431    0   NA   NA   NA   NA
6 ENSG00000221312    0   1    2    3    2

Becomes:

             gene hsap mmul mmus rnor cfam
1 ENSG00000208234    0   NA   NA   NA   2
2 ENSG00000199674    0   2    2    2    2
3 ENSG00000221622    0   NA   NA   2   NA
4 ENSG00000207604    0   NA   NA   1    2
6 ENSG00000221312    0   1    2    3    2

...where only row 5 is removed since it is the only row containing NAs for both rnor AND cfam. The boolean logic can then be changed to fit specific requirements.

Andesite answered 5/11, 2013 at 6:30 Comment(1)
but how can you use this if you want to check many columns, without typing each one, can you use a range final[,4:100]?Frady
S
23

Using dplyr package we can filter NA as follows:

dplyr::filter(df,  !is.na(columnname))
Shoulder answered 12/4, 2017 at 5:44 Comment(3)
This performs about 10.000 times slower than drop_na()Adventure
@Adventure Maybe true but for multiple variables drop_na uses "any" logic and filter uses "all" logic. So if you need more flexiblity in expression, filter has more possibilities.Awakening
@Awakening That's absolutely true! It really depends on what you're trying to achieve :)Adventure
I
20

This will return the rows that have at least ONE non-NA value.

final[rowSums(is.na(final))<length(final),]

This will return the rows that have at least TWO non-NA value.

final[rowSums(is.na(final))<(length(final)-1),]
Impossibility answered 19/9, 2014 at 12:36 Comment(1)
This is perfect, I was actually looking for a similar solution that could allow us to keep only those rows (irrespective of the columns) that have one or more non-NA values. However, this code is not working in my case. It retains all rows, even those that have all NAs.Butterfly
B
20

One approach that's both general and yields fairly-readable code is to use the filter() function and the across() helper functions from the {dplyr} package.

library(dplyr)

vars_to_check <- c("rnor", "cfam")

# Filter a specific list of columns to keep only non-missing entries

df %>% 
  filter(across(one_of(vars_to_check),
                ~ !is.na(.x)))

# Filter all the columns to exclude NA
df %>% 
  filter(across(everything(),
                ~ !is.na(.)))

# Filter only numeric columns
df %>%
  filter(across(where(is.numeric),
                ~ !is.na(.)))

Similarly, there are also the variant functions in the dplyr package (filter_all, filter_at, filter_if) which accomplish the same thing:

library(dplyr)

vars_to_check <- c("rnor", "cfam")

# Filter a specific list of columns to keep only non-missing entries
df %>% 
  filter_at(.vars = vars(one_of(vars_to_check)),
            ~ !is.na(.))

# Filter all the columns to exclude NA
df %>% 
  filter_all(~ !is.na(.))

# Filter only numeric columns
df %>%
  filter_if(is.numeric,
            ~ !is.na(.))
Bimonthly answered 23/4, 2019 at 17:21 Comment(1)
See here for another example using acrossAwakening
C
19

For your first question, I have a code that I am comfortable with to get rid of all NAs. Thanks for @Gregor to make it simpler.

final[!(rowSums(is.na(final))),]

For the second question, the code is just an alternation from the previous solution.

final[as.logical((rowSums(is.na(final))-5)),]

Notice the -5 is the number of columns in your data. This will eliminate rows with all NAs, since the rowSums adds up to 5 and they become zeroes after subtraction. This time, as.logical is necessary.

Candide answered 9/2, 2016 at 17:52 Comment(1)
final[as.logical((rowSums(is.na(final))-ncol(final))),] for a universal answerCrockett
C
14

We can also use the subset function for this.

finalData<-subset(data,!(is.na(data["mmul"]) | is.na(data["rnor"])))

This will give only those rows that do not have NA in both mmul and rnor

Cline answered 11/11, 2014 at 22:15 Comment(0)
D
13

dplyr 1.0.4 introduced two companion functions to filter: they are if_any() and if_all(). The if_all() companion function will be particularly useful in this case:

a) To remove rows that contain NAs across all columns

df %>% 
  filter(if_all(everything(), ~ !is.na(.x)))

This line will keep only those rows where none of the columns have NAs.

b) To remove rows that contain NAs in only some columns

cols_to_check = c("rnor", "cfam")

df %>% 
  filter(if_all(cols_to_check, ~ !is.na(.x)))

This line will check if any of the specified columns (cols_to_check) have NAs, and only keep those rows where this is not the case.

Diella answered 29/6, 2021 at 14:38 Comment(2)
If rows with at least one NA value needs to be removed, one can try df %>% filter(if_any(everything(), ~!is.na(.x))) or na.omit(df).Empiricism
which solution takes less time, the one you suggested or this one: df %>% filter(!if_all(everything(), ~ is.na(.)))Thermophone
A
10

Assuming dat as your dataframe, the expected output can be achieved using

1.rowSums

> dat[!rowSums((is.na(dat))),]
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0   2    2    2    2
6 ENSG00000221312    0   1    2    3    2

2.lapply

> dat[!Reduce('|',lapply(dat,is.na)),]
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0   2    2    2    2
6 ENSG00000221312    0   1    2    3    2
Aqueduct answered 15/3, 2017 at 16:51 Comment(0)
T
9

I am a synthesizer:). Here I combined the answers into one function:

#' keep rows that have a certain number (range) of NAs anywhere/somewhere and delete others
#' @param df a data frame
#' @param col restrict to the columns where you would like to search for NA; eg, 3, c(3), 2:5, "place", c("place","age")
#' \cr default is NULL, search for all columns
#' @param n integer or vector, 0, c(3,5), number/range of NAs allowed.
#' \cr If a number, the exact number of NAs kept
#' \cr Range includes both ends 3<=n<=5
#' \cr Range could be -Inf, Inf
#' @return returns a new df with rows that have NA(s) removed
#' @export
ez.na.keep = function(df, col=NULL, n=0){
    if (!is.null(col)) {
        # R converts a single row/col to a vector if the parameter col has only one col
        # see https://radfordneal.wordpress.com/2008/08/20/design-flaws-in-r-2-%E2%80%94-dropped-dimensions/#comments
        df.temp = df[,col,drop=FALSE]
    } else {
        df.temp = df
    }

    if (length(n)==1){
        if (n==0) {
            # simply call complete.cases which might be faster
            result = df[complete.cases(df.temp),]
        } else {
            # credit: https://mcmap.net/q/53082/-remove-rows-with-all-or-some-nas-missing-values-in-data-frame
            log <- apply(df.temp, 2, is.na)
            logindex <- apply(log, 1, function(x) sum(x) == n)
            result = df[logindex, ]
        }
    }

    if (length(n)==2){
        min = n[1]; max = n[2]
        log <- apply(df.temp, 2, is.na)
        logindex <- apply(log, 1, function(x) {sum(x) >= min && sum(x) <= max})
        result = df[logindex, ]
    }

    return(result)
}
Triple answered 3/2, 2016 at 17:48 Comment(0)
C
4
delete.dirt <- function(DF, dart=c('NA')) {
  dirty_rows <- apply(DF, 1, function(r) !any(r %in% dart))
  DF <- DF[dirty_rows, ]
}

mydata <- delete.dirt(mydata)

Above function deletes all the rows from the data frame that has 'NA' in any column and returns the resultant data. If you want to check for multiple values like NA and ? change dart=c('NA') in function param to dart=c('NA', '?')

Corral answered 22/2, 2018 at 22:19 Comment(0)
V
4

My guess is that this could be more elegantly solved in this way:

  m <- matrix(1:25, ncol = 5)
  m[c(1, 6, 13, 25)] <- NA
  df <- data.frame(m)
  library(dplyr) 
  df %>%
  filter_all(any_vars(is.na(.)))
  #>   X1 X2 X3 X4 X5
  #> 1 NA NA 11 16 21
  #> 2  3  8 NA 18 23
  #> 3  5 10 15 20 NA
Vincents answered 8/5, 2018 at 20:35 Comment(1)
this will retain rows with NA. I think what the OP wants is: df %>% filter_all(all_vars(!is.na(.)))Biscay
T
3

In case you want to remove only the rows having NAs in all columns, here is the solution:

df %>%
    filter(!if_all(everything(), ~  is.na(.)))
Thermophone answered 27/11, 2022 at 18:11 Comment(0)
G
0

Another option is the na_omit function of collapse:

na_omit(df)
#              gene hsap mmul mmus rnor cfam
# 1 ENSG00000199674    0    2    2    2    2
# 2 ENSG00000221312    0    1    2    3    2

Or, for selected columns:

na_omit(df, cols = c("rnor", "cfam")) #Alternatively, works with a function, index or logical vector

#              gene hsap mmul mmus rnor cfam
# 1 ENSG00000199674    0    2    2    2    2
# 2 ENSG00000207604    0   NA   NA    1    2
# 3 ENSG00000221312    0    1    2    3    2

collapse::na_omit is faster than any other solutions provided here, including data.table::na.omit and tidyr::drop_na:

#Using data from https://mcmap.net/q/53082/-remove-rows-with-all-or-some-nas-missing-values-in-data-frame
mb <- 
  microbenchmark(
  collapse = na_omit(dt),
  dt = na.omit(dt), 
  base = na.omit(df),
  complete.cases = df[complete.cases(df), ],
  tidyr = drop_na(df)
)

# Unit: milliseconds
#            expr      min       lq     mean   median        uq       max neval
#        collapse  70.6927 130.5431 191.9058 156.6320  217.1957  915.6074   100
#              dt  76.5151 130.7049 195.9737 172.6981  215.2754  735.1839   100
#            base 406.3456 778.5028 900.1791 860.7407 1005.6444 2009.2036   100
#  complete.cases 295.6927 497.8572 649.3397 575.9408  715.9452 2137.3366   100
#           tidyr  83.4586 145.3758 207.3149 180.9264  242.0750  838.3654   100
Groot answered 8/5, 2023 at 9:14 Comment(0)
L
0

For removing rows with some missings, i.e. where the columns are not known beforehand, we have @PierreL's great answer using rowSums().

I would like to quickly point out a slightly faster way using rowSums2 from the matrixStats package. For a 200,000x1,000 matrix, this saves almost a second.

The function is applied in the same way. (Note, that the result of is.na() is already of class "matrix", otherwise rowSums2 would have complained.)

M[matrixStats::rowSums2(is.na(M)) < dim(M)[2]*1e-2, ]  ## allow 1% missings per row

Benchmark

Unit: seconds
     expr      min       lq     mean   median       uq      max neval
  rowSums 3.083927 3.104315 3.135399 3.131235 3.162080 3.209985   100
 rowSums2 2.129368 2.241577 2.347562 2.287717 2.494494 2.683859   100

enter image description here

Code:

m=2e5; n=1e3
set.seed(42)
M <- matrix(rpois(n*m, 2), m, n)
M[sample(seq_along(M), m*n*.01)] <- NA_integer_

microbenchmark::microbenchmark(
  rowSums=M[rowSums(is.na(M)) < dim(M)[2]*1e-2, ],
  rowSums2=M[matrixStats::rowSums2(is.na(M)) < dim(M)[2]*1e-2, ],
  check='identical'
  ) |> print()

Rscript --vanilla ~/is_na_issue.R
Latish answered 23/7, 2023 at 17:5 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.