Remove columns from dataframe where ALL values are NA

Asked 15/4, 2010 at 8:59 Answered 9/5, 2023 at 9:31

210

I have a data frame where some of the columns contain NA values.

How can I remove columns where all rows contain NA values?

Schenck answered 15/4, 2010 at 8:59 Comment(0)

209

Try this:

df <- df[,colSums(is.na(df))<nrow(df)]

Aircraft answered 15/4, 2010 at 9:12 Comment(9)

This creates an object the size of the old object which is a problem with memory on large objects. Better to use a function to reduce the size. The answer bellow using Filter or using data.table will help your memory usage. – Tenantry 9/12, 2015 at 15:21

This does not appear to work with non-numeric columns. – Discriminator 1/2, 2017 at 23:17

It changes column name if they are duplicated – Neoclassicism 26/6, 2018 at 16:42

To do this with non-numeric columns, @mnel's solution using Filter() is a good one. A benchmark of multiple approaches can be found in this post – Gout 24/11, 2020 at 17:47

Doesn't seem to work with single-row data frames. – Rickyrico 26/5, 2022 at 9:32

An alternative might be df[colSums(!is.na(df)) > 0] which will return a data.frame also in case only 1 column is left and uses df only on 2 instead of 3 positions. (Taken from a currently delete post.) – Mastin 10/5, 2023 at 9:38

@Mastin another option is to set drop = FALSE to avoid dropping dimensions: df <- df[, colSums(is.na(df)) < nrow(df), drop = FALSE] – Paleontology 10/5, 2023 at 9:42

@Paleontology Yes but the command is longer and df is used on 3 instead of 2 positions. So I would prefer using df[colSums(!is.na(df)) > 0] compared to df[, colSums(is.na(df)) < nrow(df), drop = FALSE]. If you look in the source in Filter it also uses df[ind] and not df[,ind, drop = FALSE]. – Mastin 10/5, 2023 at 9:49

@Mastin I wasn't trying to argue that it is the better option. I just wanted to mention the "built-in" way to keep working with a data.frame as it is a common pitfall (see @gaspar's comment). Cheers and thanks for your contribution! – Paleontology 10/5, 2023 at 9:58

131

The two approaches offered thus far fail with large data sets as (amongst other memory issues) they create is.na(df), which will be an object the same size as df.

Here are two approaches that are more memory and time efficient

An approach using Filter

Filter(function(x)!all(is.na(x)), df)

and an approach using data.table (for general time and memory efficiency)

library(data.table)
DT <- as.data.table(df)
DT[,which(unlist(lapply(DT, function(x)!all(is.na(x))))),with=F]

examples using large data (30 columns, 1e6 rows)

big_data <- replicate(10, data.frame(rep(NA, 1e6), sample(c(1:8,NA),1e6,T), sample(250,1e6,T)),simplify=F)
bd <- do.call(data.frame,big_data)
names(bd) <- paste0('X',seq_len(30))
DT <- as.data.table(bd)

system.time({df1 <- bd[,colSums(is.na(bd) < nrow(bd))]})
# error -- can't allocate vector of size ...
system.time({df2 <- bd[, !apply(is.na(bd), 2, all)]})
# error -- can't allocate vector of size ...
system.time({df3 <- Filter(function(x)!all(is.na(x)), bd)})
## user  system elapsed 
## 0.26    0.03    0.29 
system.time({DT1 <- DT[,which(unlist(lapply(DT, function(x)!all(is.na(x))))),with=F]})
## user  system elapsed 
## 0.14    0.03    0.18

Emera answered 27/9, 2012 at 5:36 Comment(5)

Very nice. You could do the same with data.frame, though. There's nothing here that really needs data.table. The key is the lapply, which avoids the copy of the whole object done by is.na(df). +10 for pointing that out. – Saltpeter 27/9, 2012 at 9:59

How would you do it with a data.frame? @matt-dowle – Kino 22/5, 2014 at 15:52

@s_a, bd1 <- bd[, unlist(lapply(bd, function(x), !all(is.na(x))))] – Emera 22/5, 2014 at 22:55

@Emera I think you need to remove the , after function(x) - thanks for the example btw – Unholy 22/9, 2014 at 8:56

Can you do it faster with := or with a set() ? – Rudderhead 18/7, 2016 at 13:50

104

Update

You can now use select with the where selection helper. select_if is superceded, but still functional as of dplyr 1.0.2. (thanks to @mcstrother for bringing this to attention).

library(dplyr)
temp <- data.frame(x = 1:5, y = c(1,2,NA,4, 5), z = rep(NA, 5))
not_all_na <- function(x) any(!is.na(x))
not_any_na <- function(x) all(!is.na(x))

> temp
  x  y  z
1 1  1 NA
2 2  2 NA
3 3 NA NA
4 4  4 NA
5 5  5 NA

> temp %>% select(where(not_all_na))
  x  y
1 1  1
2 2  2
3 3 NA
4 4  4
5 5  5

> temp %>% select(where(not_any_na))
  x
1 1
2 2
3 3
4 4
5 5

Old Answer

dplyr now has a select_if verb that may be helpful here:

> temp
  x  y  z
1 1  1 NA
2 2  2 NA
3 3 NA NA
4 4  4 NA
5 5  5 NA

> temp %>% select_if(not_all_na)
  x  y
1 1  1
2 2  2
3 3 NA
4 4  4
5 5  5

> temp %>% select_if(not_any_na)
  x
1 1
2 2
3 3
4 4
5 5

Mimesis answered 14/5, 2018 at 16:40 Comment(6)

Came here looking for the dplyr solution. Was not disappointed. Thanks! – Countercurrent 1/8, 2019 at 16:14

I found this had the issue that it would also delete variables with most but not all values as missing – Demagnetize 21/5, 2020 at 13:26

select_if is now superseded in dplyr, so the last two lines would be temp %>% select(where(not_all_na)) in the most recent syntax -- although select_if still works for now as of dplyr 1.0.2. Also temp %>% select(where(~!all(is.na(.x)))) works if you don't feel like defining the function on a separate line. – Zincate 31/12, 2020 at 21:50

@Zincate thank you - that is a very helpful update to my answer. If you'd like to answer it yourself I'll happily roll back the edits. – Mimesis 6/1, 2021 at 19:8

not_any_na is not found for me. where does this come from? I have dplyr loaded..... – Haploid 3/11, 2021 at 15:2

@SkyScraper it's a function defined in the code provided – Mimesis 4/11, 2021 at 0:59

Late to the game but you can also use the janitor package. This function will remove columns which are all NA, and can be changed to remove rows that are all NA as well.

df <- janitor::remove_empty(df, which = "cols")

Zusman answered 14/5, 2019 at 21:48 Comment(0)

Another options with purrr package:

library(dplyr)

df <- data.frame(a = NA,
                 b = seq(1:5), 
                 c = c(rep(1, 4), NA))

df %>% purrr::discard(~all(is.na(.)))
df %>% purrr::keep(~!all(is.na(.)))

Attested answered 26/1, 2020 at 6:46 Comment(1)

purrr version 1.0.2 recommends using a formula (~) only if you require backward compatibility with older versions of R. A variant with an anonymous function looks like purrr::discard(\(x) all(is.na(x))). – Resistencia 4/4, 2024 at 13:27

Another way would be to use the apply() function.

If you have the data.frame

df <- data.frame (var1 = c(1:7,NA),
                  var2 = c(1,2,1,3,4,NA,NA,9),
                  var3 = c(NA)
                  )

then you can use apply() to see which columns fulfill your condition and so you can simply do the same subsetting as in the answer by Musa, only with an apply approach.

> !apply (is.na(df), 2, all)
 var1  var2  var3 
 TRUE  TRUE FALSE 

> df[, !apply(is.na(df), 2, all)]
  var1 var2
1    1    1
2    2    2
3    3    1
4    4    3
5    5    4
6    6   NA
7    7   NA
8   NA    9

Lifetime answered 15/4, 2010 at 12:36 Comment(1)

I expected this to be quicker, as the colSum() solution seemed to be doing more work. But on my test set (213 obs. of 1614 variables before, vs. 1377 variables afterwards) it takes exactly 3 times longer. (But +1 for an interesting approach.) – Barclay 17/2, 2012 at 12:1

df[sapply(df, function(x) all(is.na(x)))] <- NULL

Ophthalmoscope answered 13/4, 2017 at 19:53 Comment(0)

An old question, but I think we can update @mnel's nice answer with a simpler data.table solution:

DT[, .SD, .SDcols = \(x) !all(is.na(x))]

(I'm using the new \(x) lambda function syntax available in R>=4.1, but really the key thing is to pass the logical subsetting through .SDcols.

Speed is equivalent.

microbenchmark::microbenchmark(
  which_unlist = DT[, which(unlist(lapply(DT, \(x) !all(is.na(x))))), with=FALSE],
  sdcols       = DT[, .SD, .SDcols = \(x) !all(is.na(x))],
  times = 2
)
#> Unit: milliseconds
#>          expr      min       lq     mean   median       uq      max neval cld
#>  which_unlist 51.32227 51.32227 56.78501 56.78501 62.24776 62.24776     2   a
#>        sdcols 43.14361 43.14361 49.33491 49.33491 55.52621 55.52621     2   a

Aulos answered 5/11, 2021 at 19:57 Comment(0)

You can use Janitor package remove_empty

library(janitor)

df %>%
  remove_empty(c("rows", "cols")) #select either row or cols or both

Also, Another dplyr approach

 library(dplyr) 
 df %>% select_if(~all(!is.na(.)))

df %>% select_if(colSums(!is.na(.)) == nrow(df))

this is also useful if you want to only exclude / keep column with certain number of missing values e.g.

 df %>% select_if(colSums(!is.na(.))>500)

Mastoidectomy answered 2/6, 2020 at 2:26 Comment(0)

I hope this may also help. It could be made into a single command, but I found it easier for me to read by dividing it in two commands. I made a function with the following instruction and worked lightning fast.

naColsRemoval = function (DataTable) {
     na.cols = DataTable [ , .( which ( apply ( is.na ( .SD ) , 2 , all ) ) )]
     DataTable [ , unlist (na.cols) := NULL , with = F]
     }

.SD will allow to limit the verification to part of the table, if you wish, but it will take the whole table as

Prankster answered 21/7, 2015 at 12:57 Comment(0)

A handy base R option could be colMeans():

df[, colMeans(is.na(df)) != 1]

Jari answered 26/3, 2020 at 12:4 Comment(0)

From my experience of having trouble applying previous answers, I have found that I needed to modify their approach in order to achieve what the question here is:

How to get rid of columns where for ALL rows the value is NA?

First note that my solution will only work if you do not have duplicate columns (that issue is dealt with here (on stack overflow)

Second, it uses dplyr.

Instead of

df <- df %>% select_if(~all(!is.na(.)))

I find that what works is

df <- df %>% select_if(~!all(is.na(.)))

The point is that the "not" symbol "!" needs to be on the outside of the universal quantifier. I.e. the select_if operator acts on columns. In this case, it selects only those that do not satisfy the criterion

every element is equal to "NA"

Gibb answered 20/8, 2020 at 2:29 Comment(0)

library(dplyr)

# create a sample data frame
df <- data.frame(x = c(1, 2, NA, 4),
                 y = c(NA, NA, NA, NA),
                 z = c(6, 7, NA, 9))

# remove columns with all NAs
df <- df %>%
  select_if(~!all(is.na(.)))

Tullis answered 9/5, 2023 at 9:31 Comment(0)

janitor::remove_constant()

does this very nicely.

Bellows answered 13/12, 2020 at 15:5 Comment(1)

janitor::remove_empty() would be more appropriate here. ?remove_empty = "Remove empty rows and/or columns from a data.frame or matrix" – Foreconscious 14/8, 2021 at 1:29

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

examples using large data (30 columns, 1e6 rows)

Update

Old Answer

Recommended topics

Hot tags