Remove rows which have all NAs in certain columns
Asked Answered
H

5

14

Suppose you have a dataframe with 9 columns. You want to remove cases which have all NAs in columns 5:9. It's not at all relevant if there are NAs in columns 1:4.

So far I have found functions that allow you to remove rows that have NAs in any of the columns 5:9, but I specifically need to remove only those that have all NAs in columns 5:9.

I wrote my own function to do this, but since I have 300k+ rows, it's very slow. I was wondering is there a more efficient way? This is my code:

remove.select.na<-function(x, cols){
  nrm<-vector("numeric")
  for (i in 1:nrow(x)){
    if (sum(is.na(x[i,cols]))<length(cols)){
      nrm<-c(nrm,i)
    }
    #Console output to track the progress
    cat('\r',paste0('Checking row ',i,' of ',nrow(x),' (', format(round(i/nrow(x)*100,2), nsmall = 2),'%).'))
    flush.console()
  }
  x<-x[nrm,]
  rm(nrm)
  return(x)
}

where x is the dataframe and cols is a vector containing names of the columns that should be checked for NAs.

Harslet answered 30/7, 2018 at 14:47 Comment(0)
F
12

This a one-liner to remove the rows with NA in all columns between 5 and 9. By combining rowSums() with is.na() it is easy to check whether all entries in these 5 columns are NA:

x <- x[rowSums(is.na(x[,5:9]))!=5,]
Felix answered 30/7, 2018 at 18:37 Comment(0)
A
8

You can use all with apply to find rows where all values are NA:

x[!apply(is.na(x[,5:9]), 1, all),]

or negate is.na and test for any:

x[apply(!is.na(x[,5:9]), 1, any),]

or using rowSums like @RHertel wher you dont need to calculate the number of selected rows:

x[rowSums(!is.na(x[,5:9])) > 0,]
Analysand answered 7/1, 2020 at 14:27 Comment(0)
A
6

Here are two dplyr options:

library(dplyr)
df <- data_frame(a = c(0, NA, 0, 4, NA, 0, 6), b = c(1, NA, 0, 4, NA, 0, NA), c = c(1, 0, 1, NA, NA, 0, NA))


# columns b and c would be the columns you don't want all NAs

df %>% 
  filter_at(vars(b, c), any_vars(!is.na(.)))

df %>% 
  filter_at(vars(b, c), any_vars(complete.cases(.)))

# A tibble: 5 x 3
      a     b     c
  <dbl> <dbl> <dbl>
1     0     1     1
2    NA    NA     6
3     0     6     1
4     4     4    NA
5     0     0     0

In the newer version of dplyr, use if_any

df %>% 
      filter(if_any(c(b, c), complete.cases))
Arrivederci answered 30/7, 2018 at 18:38 Comment(2)
nice solution! turns out that any_vars() is superseded in favor of across(). However, I wasn't able to translate the solution here to rely on across(). Any hint?Chipper
@M-- Perhaps df %>% filter(if_any(c(b, c), complete.cases))Carabin
A
0
lines=
   'V1   V2   V3   V4  
    A    10   20   NA   
    B    NA   NA   NA   
    C     5   20   3     
    D    15   20   4    
    E    NA   10   5'

df = read.table(textConnection(lines), header = T)

df[is.na(df)] = 'X'


attach(df)

x = subset(df, V2 == 'X'   &   V3 == 'X'   &   V4 == 'X')
df_new = df[-as.numeric(row.names(x)),]
df_new

#  V1 V2 V3 V4
#1  A 10 20  X
#3  C  5 20  3
#4  D 15 20  4
#5  E  X 10  5

detach(df)
Autocrat answered 30/7, 2018 at 15:37 Comment(1)
But this function will delete cases that have ANY missing values in columns 3 & 4. As outlined above, I need to delete only cases that have ALL missing values in columns 3 & 4. In other words: -If there is a missing value in column 3, but not in column 4, the case shouldn't be deleted. - If there is a missing value in column 4, but not in column 3, the case shouldn't be deleted. - Only if there is a missing value BOTH in column 3 and column 4 should the case be deleted. - It's not important at all if there are missing values in columns 1 & 2.Harslet
S
0

I don't know that it's any faster than your function, but maybe you could use !any and is.na for each row of your data frame. With this example data:

set.seed(1234)
x = do.call(cbind, lapply(1:9, function(x) runif(10)))
x[sample(length(x), size = 70)] <- NA
x <- data.frame(x)

> x
     X1 X2   X3   X4   X5   X6   X7   X8  X9
1  0.11 NA   NA 0.46 0.55 0.07   NA   NA  NA
2  0.62 NA   NA   NA   NA   NA 0.04   NA  NA
3    NA NA   NA 0.30   NA   NA   NA 0.01  NA
4  0.62 NA 0.04 0.51   NA   NA   NA   NA  NA
5  0.86 NA   NA 0.18   NA   NA   NA   NA 0.2
6  0.64 NA   NA   NA   NA 0.50   NA 0.52  NA
7    NA NA   NA   NA 0.68   NA   NA   NA  NA
8    NA NA   NA   NA   NA   NA   NA   NA  NA
9    NA NA   NA   NA   NA 0.17   NA   NA  NA
10   NA NA 0.05   NA   NA   NA   NA   NA  NA

Looks like the 4th, 8th, and 10th rows should be dropped. So, you can use apply to iterate over each row to see if the condition is satisfied- any row where with any values other than NA in the 5th to 9th column will return TRUE, so you can use that as an indexer for your data frame.

keep.rows <- apply(x[, 5:9], 1, FUN = function(row){
  any(!is.na(row))
})

> x[keep.rows, ]
    X1 X2 X3   X4   X5   X6   X7   X8  X9
1 0.11 NA NA 0.46 0.55 0.07   NA   NA  NA
2 0.62 NA NA   NA   NA   NA 0.04   NA  NA
3   NA NA NA 0.30   NA   NA   NA 0.01  NA
5 0.86 NA NA 0.18   NA   NA   NA   NA 0.2
6 0.64 NA NA   NA   NA 0.50   NA 0.52  NA
7   NA NA NA   NA 0.68   NA   NA   NA  NA
9   NA NA NA   NA   NA 0.17   NA   NA  NA

Again, not sure that it's faster than your function but... maybe?

Sweeny answered 30/7, 2018 at 17:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.