I have a data frame where some of the columns contain NA values.
How can I remove columns where all rows contain NA values?
I have a data frame where some of the columns contain NA values.
How can I remove columns where all rows contain NA values?
Try this:
df <- df[,colSums(is.na(df))<nrow(df)]
df[colSums(!is.na(df)) > 0]
which will return a data.frame
also in case only 1 column is left and uses df
only on 2 instead of 3 positions. (Taken from a currently delete post.) –
Mastin drop = FALSE
to avoid dropping dimensions: df <- df[, colSums(is.na(df)) < nrow(df), drop = FALSE]
–
Paleontology df[colSums(!is.na(df)) > 0]
compared to df[, colSums(is.na(df)) < nrow(df), drop = FALSE]
. If you look in the source in Filter
it also uses df[ind]
and not df[,ind, drop = FALSE]
. –
Mastin data.frame
as it is a common pitfall (see @gaspar's comment). Cheers and thanks for your contribution! –
Paleontology The two approaches offered thus far fail with large data sets as (amongst other memory issues) they create is.na(df)
, which will be an object the same size as df
.
Here are two approaches that are more memory and time efficient
An approach using Filter
Filter(function(x)!all(is.na(x)), df)
and an approach using data.table (for general time and memory efficiency)
library(data.table)
DT <- as.data.table(df)
DT[,which(unlist(lapply(DT, function(x)!all(is.na(x))))),with=F]
big_data <- replicate(10, data.frame(rep(NA, 1e6), sample(c(1:8,NA),1e6,T), sample(250,1e6,T)),simplify=F)
bd <- do.call(data.frame,big_data)
names(bd) <- paste0('X',seq_len(30))
DT <- as.data.table(bd)
system.time({df1 <- bd[,colSums(is.na(bd) < nrow(bd))]})
# error -- can't allocate vector of size ...
system.time({df2 <- bd[, !apply(is.na(bd), 2, all)]})
# error -- can't allocate vector of size ...
system.time({df3 <- Filter(function(x)!all(is.na(x)), bd)})
## user system elapsed
## 0.26 0.03 0.29
system.time({DT1 <- DT[,which(unlist(lapply(DT, function(x)!all(is.na(x))))),with=F]})
## user system elapsed
## 0.14 0.03 0.18
data.frame
, though. There's nothing here that really needs data.table
. The key is the lapply
, which avoids the copy of the whole object done by is.na(df)
. +10 for pointing that out. –
Saltpeter bd1 <- bd[, unlist(lapply(bd, function(x), !all(is.na(x))))]
–
Emera ,
after function(x)
- thanks for the example btw –
Unholy You can now use select
with the where
selection helper. select_if
is superceded, but still functional as of dplyr 1.0.2. (thanks to @mcstrother for bringing this to attention).
library(dplyr)
temp <- data.frame(x = 1:5, y = c(1,2,NA,4, 5), z = rep(NA, 5))
not_all_na <- function(x) any(!is.na(x))
not_any_na <- function(x) all(!is.na(x))
> temp
x y z
1 1 1 NA
2 2 2 NA
3 3 NA NA
4 4 4 NA
5 5 5 NA
> temp %>% select(where(not_all_na))
x y
1 1 1
2 2 2
3 3 NA
4 4 4
5 5 5
> temp %>% select(where(not_any_na))
x
1 1
2 2
3 3
4 4
5 5
dplyr
now has a select_if
verb that may be helpful here:
> temp
x y z
1 1 1 NA
2 2 2 NA
3 3 NA NA
4 4 4 NA
5 5 5 NA
> temp %>% select_if(not_all_na)
x y
1 1 1
2 2 2
3 3 NA
4 4 4
5 5 5
> temp %>% select_if(not_any_na)
x
1 1
2 2
3 3
4 4
5 5
dplyr
solution. Was not disappointed. Thanks! –
Countercurrent select_if
is now superseded in dplyr, so the last two lines would be temp %>% select(where(not_all_na))
in the most recent syntax -- although select_if
still works for now as of dplyr 1.0.2. Also temp %>% select(where(~!all(is.na(.x))))
works if you don't feel like defining the function on a separate line. –
Zincate not_any_na
is not found for me. where does this come from? I have dplyr
loaded..... –
Haploid Late to the game but you can also use the janitor
package. This function will remove columns which are all NA, and can be changed to remove rows that are all NA as well.
df <- janitor::remove_empty(df, which = "cols")
Another options with purrr
package:
library(dplyr)
df <- data.frame(a = NA,
b = seq(1:5),
c = c(rep(1, 4), NA))
df %>% purrr::discard(~all(is.na(.)))
df %>% purrr::keep(~!all(is.na(.)))
purrr::discard(\(x) all(is.na(x)))
. –
Resistencia Another way would be to use the apply()
function.
If you have the data.frame
df <- data.frame (var1 = c(1:7,NA),
var2 = c(1,2,1,3,4,NA,NA,9),
var3 = c(NA)
)
then you can use apply()
to see which columns fulfill your condition and so you can simply do the same subsetting as in the answer by Musa, only with an apply
approach.
> !apply (is.na(df), 2, all)
var1 var2 var3
TRUE TRUE FALSE
> df[, !apply(is.na(df), 2, all)]
var1 var2
1 1 1
2 2 2
3 3 1
4 4 3
5 5 4
6 6 NA
7 7 NA
8 NA 9
df[sapply(df, function(x) all(is.na(x)))] <- NULL
An old question, but I think we can update @mnel's nice answer with a simpler data.table solution:
DT[, .SD, .SDcols = \(x) !all(is.na(x))]
(I'm using the new \(x)
lambda function syntax available in R>=4.1, but really the key thing is to pass the logical subsetting through .SDcols
.
Speed is equivalent.
microbenchmark::microbenchmark(
which_unlist = DT[, which(unlist(lapply(DT, \(x) !all(is.na(x))))), with=FALSE],
sdcols = DT[, .SD, .SDcols = \(x) !all(is.na(x))],
times = 2
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> which_unlist 51.32227 51.32227 56.78501 56.78501 62.24776 62.24776 2 a
#> sdcols 43.14361 43.14361 49.33491 49.33491 55.52621 55.52621 2 a
You can use Janitor package remove_empty
library(janitor)
df %>%
remove_empty(c("rows", "cols")) #select either row or cols or both
Also, Another dplyr approach
library(dplyr)
df %>% select_if(~all(!is.na(.)))
OR
df %>% select_if(colSums(!is.na(.)) == nrow(df))
this is also useful if you want to only exclude / keep column with certain number of missing values e.g.
df %>% select_if(colSums(!is.na(.))>500)
I hope this may also help. It could be made into a single command, but I found it easier for me to read by dividing it in two commands. I made a function with the following instruction and worked lightning fast.
naColsRemoval = function (DataTable) {
na.cols = DataTable [ , .( which ( apply ( is.na ( .SD ) , 2 , all ) ) )]
DataTable [ , unlist (na.cols) := NULL , with = F]
}
.SD will allow to limit the verification to part of the table, if you wish, but it will take the whole table as
A handy base R
option could be colMeans()
:
df[, colMeans(is.na(df)) != 1]
From my experience of having trouble applying previous answers, I have found that I needed to modify their approach in order to achieve what the question here is:
How to get rid of columns where for ALL rows the value is NA?
First note that my solution will only work if you do not have duplicate columns (that issue is dealt with here (on stack overflow)
Second, it uses dplyr
.
Instead of
df <- df %>% select_if(~all(!is.na(.)))
I find that what works is
df <- df %>% select_if(~!all(is.na(.)))
The point is that the "not" symbol "!" needs to be on the outside of the universal quantifier. I.e. the select_if
operator acts on columns. In this case, it selects only those that do not satisfy the criterion
every element is equal to "NA"
library(dplyr)
# create a sample data frame
df <- data.frame(x = c(1, 2, NA, 4),
y = c(NA, NA, NA, NA),
z = c(6, 7, NA, 9))
# remove columns with all NAs
df <- df %>%
select_if(~!all(is.na(.)))
janitor::remove_constant()
does this very nicely.
© 2022 - 2025 — McMap. All rights reserved.