How to find the percentage of NAs in a data.frame?
Asked Answered
L

6

14

I am trying to find the percentage of NAs in columns as well as inside the whole dataframe:

The first method which I have commented gives me zero and the second method which is not commented gives me a matrix. Not sure what I am missing. Any hint is truly appreciated!

cp.2006<-read.csv(file="cp2006.csv",head=TRUE)

#countNAs <- function(x) { 
#  sum(is.na(x)) 
#} 
#total=0
#for (i in col(cp.2006)) {
#  total=countNAs(i)+total
#}
#print(total)
count<-apply(cp.2006, 1, function(x) sum(is.na(x)))
dims<-dim(cp.2006)
num<-dims[1]*dims[2]
NApercentage<-(count/num) * 100
print(NApercentage)
Longwise answered 11/5, 2014 at 19:47 Comment(0)
E
32
x = data.frame(x = c(1, 2, NA, 3), y = c(NA, NA, 4, 5))

For the whole dataframe:

sum(is.na(x))/prod(dim(x))

Or

mean(is.na(x))

For columns:

apply(x, 2, function(col)sum(is.na(col))/length(col))

Or

colMeans(is.na(x))
Entwistle answered 11/5, 2014 at 19:53 Comment(6)
I was just working with is.na(X) and realized I don't even need apply, right? > sum(is.na(cp.2006)) [1] 138Longwise
or just mean(is.na(x))Unwilling
cols.NA<apply(cp.2006,2,function(col)sum(is.na(col))/length(col))*100Longwise
@fernando why the second argument to your apply function is 2 ?Longwise
I noticed you edited to prod(dim(x)) after I posted my answer. Nice.Tour
Or df.isna().mean()Shutt
O
8

Updated version of dplyr which doesnt support funs anymore:

x%>% summarise_all(list(name = ~sum(is.na(.))/length(.)))

Ogletree answered 31/5, 2019 at 22:51 Comment(0)
B
5

You could also use dplyr::summarize_all for the column-wise proportions.

x %>% summarize_all(funs(sum(is.na(.)) / length(.)))

Which will give

     x   y
1 0.25 0.5
Bissell answered 28/7, 2017 at 12:34 Comment(0)
A
3

If you are interested to find percentage of complete cases.

Using Same Example mentioned here.

x = data.frame(x = c(1, 2, NA, 3), y = c(NA, NA, 4, 5))

Output :

   x  y
1  1 NA
2  2 NA
3 NA  4
4  3  5

Finding Complete cases:

complete.cases(x)

Output :

[1] FALSE FALSE FALSE  TRUE

Percentage of complete cases:

mean(complete.cases(x))

Output:

[1] 0.25

That means 25% of complete rows are available in data provided. i.e Only fourth row is complete rest all contains NA values.

Cheers!

Anchorage answered 17/3, 2018 at 8:32 Comment(0)
E
0

You can Try This

colMeans(is.na.data.frame(dataframe_name))
Erato answered 1/8, 2020 at 16:57 Comment(0)
A
0

Try this :

sapply(data, function(y) round((sum(length(which(is.na(y))))/nrow(data))*100.00,2))
Azriel answered 27/9, 2020 at 11:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.