Elegant way to report missing values in a data.frame

Asked 29/11, 2011 at 20:23 Answered 24/5, 2024 at 0:32

Here's a little piece of code I wrote to report variables with missing values from a data frame. I'm trying to think of a more elegant way to do this, one that perhaps returns a data.frame, but I'm stuck:

for (Var in names(airquality)) {
    missing <- sum(is.na(airquality[,Var]))
    if (missing > 0) {
        print(c(Var,missing))
    }
}

Edit: I'm dealing with data.frames with dozens to hundreds of variables, so it's key that we only report variables with missing values.

Imine answered 29/11, 2011 at 20:23 Comment(5)

@kohske: that was my first thought, but the results are a table of characters and you would have to parse out the number of NAs. – Pillory 29/11, 2011 at 20:33

I'm reverting your question, since you posted an answer. If you want to comment on an answer, please do so as a comment to that answer. If questions also contains answers, this becomes very confusing. – Moonlight 29/11, 2011 at 20:48

@Andrie: I disagree with your edit, as a key issue I am facing is reporting only variables with missing values. Furthermore, your rollback removed a change I made to the code. I've edited my question to include this information, and added my modified version of Josh's code to a comment. – Imine 29/11, 2011 at 20:57

@Imine Your new edit looks fine to me. I'm not averse to adding additional data / requests in a question once it's live, by the way, if this clarifies the question. – Moonlight 29/11, 2011 at 21:1

There are half a million ways to do this, see CRAN Task View - MissingData – Magnetostriction 20/5, 2019 at 16:40

168

Just use sapply

> sapply(airquality, function(x) sum(is.na(x)))
  Ozone Solar.R    Wind    Temp   Month     Day 
     37       7       0       0       0       0

You could also use apply or colSums on the matrix created by is.na()

> apply(is.na(airquality),2,sum)
  Ozone Solar.R    Wind    Temp   Month     Day 
     37       7       0       0       0       0
> colSums(is.na(airquality))
  Ozone Solar.R    Wind    Temp   Month     Day 
     37       7       0       0       0       0

Pillory answered 29/11, 2011 at 20:29 Comment(4)

I modified your code slightly, to only report missing value: M <- sapply(airquality, function(x) sum(is.na(x))); M[M>0] – Imine 29/11, 2011 at 20:54

Hi @Joshua Ulrich , thanks a lot for your succinct code. I would like to add a column in the data frame which would show the percentage of na values . Can you please provide some help regarding this ? – Haifa 17/6, 2017 at 6:25

@Imine I use a version of your suggestion for checking that required fields have a value: M <- colSums(is.na(airquality)); M[M <= 0] – Monotonous 3/10, 2017 at 20:12

@Joshua adding an option for %s would be ace too! – Obstruent 30/5, 2019 at 1:3

My new favourite for (not too wide) data are methods from excellent naniar package. Not only you get frequencies but also patterns of missingness:

library(naniar)
library(UpSetR)

riskfactors %>%
  as_shadow_upset() %>%
  upset()

It's often useful to see where the missings are in relation to non missing which can be achieved by plotting scatter plot with missings:

ggplot(airquality,
       aes(x = Ozone,
           y = Solar.R)) +
 geom_miss_point()

Or for categorical variables:

gg_miss_fct(x = riskfactors, fct = marital)

These examples are from package vignette that lists other interesting visualizations.

Obstruent answered 2/7, 2018 at 7:9 Comment(1)

Thanks for posting this! There is now a dedicated gg_miss_upset() function in the latest release, which will be submitted to CRAN once they return from holiday.naniar.njtierney.com/reference/gg_miss_upset.html – Amatol 10/9, 2018 at 6:41

We can use map_df with purrr.

library(mice)
library(purrr)

# map_df with purrr
map_df(airquality, function(x) sum(is.na(x)))
# A tibble: 1 × 6
# Ozone Solar.R  Wind  Temp Month   Day
# <int>   <int> <int> <int> <int> <int>
# 1    37       7     0     0     0     0

Windpipe answered 27/1, 2017 at 2:58 Comment(5)

What's the advantage of map_df over sapply? – Imine 27/1, 2017 at 15:8

@Imine I think that there is no big difference, but Hadley said not to use sapply() inside a function. See Exceptions and debugging · Advanced R. adv-r.had.co.nz/Exceptions-Debugging.html. – Windpipe 27/1, 2017 at 17:59

for lazy people like me, you can write the above code in shorter purrr syntax for functions (~) so it looks like this: map_df( air quality, ~sum(is.na(.) ) – Hopper 29/1, 2019 at 9:37

@Imine the advantage of map_df over sapply is only when the result has many rows as the output format of map_df is always a tibble. – Hopper 29/1, 2019 at 9:40

@Zach: it's better to use vapply vs sapply in functions because vapply gives you a known result structure (that you specify). sapply may return an array or a list, depending on the function output. A disadvantage of map_df is that you give it a data.frame as input, and it returns a data.frame subclass, not a data.frame. There's no guarantee that tibbles will behave as data.frames do in all necessary cases in the future. – Pillory 30/5, 2019 at 10:9

summary(airquality)

already gives you this information

The VIM packages also offers some nice missing data plot for data.frame

library("VIM")
aggr(airquality)

Reyesreykjavik answered 10/11, 2016 at 20:8 Comment(2)

Can the VIM package report which specific observations have missing data? – Monotonous 17/11, 2016 at 18:50

don't think so.. but you can get this quite easy (you would have to replace airquality with your own dataframe): res <- airquality[rowSums(is.na(airquality)) > 0,] – Reyesreykjavik 17/11, 2016 at 21:17

Another graphical alternative - plot_missing function from excellent DataExplorer package:

Docs also points out to the fact that you can save this results for additional analysis with missing_data <- plot_missing(data).

Obstruent answered 13/2, 2018 at 2:12 Comment(2)

The plot_missing() function in the DataExplorer package is now PlotMissing(). – Bandylegged 6/3, 2018 at 22:59

@Bandylegged PlotMissing() is deprecated. Please use plot_missing() instead. See #49 for more details. – Boustrophedon 2/8, 2018 at 13:17

More succinct-: sum(is.na(x[1]))

That is

x[1] Look at the first column
is.na() true if it's NA
sum() TRUE is 1, FALSE is 0

Graphite answered 11/4, 2015 at 13:34 Comment(1)

this doesn't answer the original question, which is to find the number of NAs for all columns in the data – Hungerford 11/4, 2015 at 13:37

Another function that would help you look at missing data would be df_status from funModeling library

library(funModeling)

iris.2 is the iris dataset with some added NAs.You can replace this with your dataset.

df_status(iris.2)

This will give you the number and percentage of NAs in each column.

Yingling answered 6/2, 2017 at 6:28 Comment(0)

For one more graphical solution, visdat package offers vis_miss.

library(visdat)
vis_miss(airquality)

Very similar to Amelia output with a small difference of giving %s on missings out of the box.

Obstruent answered 3/12, 2017 at 23:37 Comment(0)

I think the Amelia library does a nice job in handling missing data also includes a map for visualizing the missing rows.

install.packages("Amelia")
library(Amelia)
missmap(airquality)

You can also run the following code will return the logic values of na

row.has.na <- apply(training, 1, function(x){any(is.na(x))})

Ulrikeulster answered 19/4, 2017 at 9:44 Comment(0)

Another graphical and interactive way is to use is.na10 function from heatmaply library:

library(heatmaply)

heatmaply(is.na10(airquality), grid_gap = 1, 
          showticklabels = c(T,F),
            k_col =3, k_row = 3,
            margins = c(55, 30), 
            colors = c("grey80", "grey20"))

Probably won't work well with large datasets..

Obstruent answered 27/8, 2018 at 4:5 Comment(0)

A dplyr solution to get the count could be:

summarise_all(df, ~sum(is.na(.)))

Or to get a percentage:

summarise_all(df, ~(sum(is_missing(.) / nrow(df))))

Maybe also worth noting that missing data can be ugly, inconsistent, and not always coded as NA depending on the source or how it's handled when imported. The following function could be tweaked depending on your data and what you want to consider missing:

is_missing <- function(x){
  missing_strs <- c('', 'null', 'na', 'nan', 'inf', '-inf', '-9', 'unknown', 'missing')
  ifelse((is.na(x) | is.nan(x) | is.infinite(x)), TRUE,
         ifelse(trimws(tolower(x)) %in% missing_strs, TRUE, FALSE))
}

# sample ugly data
df <- data.frame(a = c(NA, '1', '  ', 'missing'),
                 b = c(0, 2, NaN, 4),
                 c = c('NA', 'b', '-9', 'null'),
                 d = 1:4,
                 e = c(1, Inf, -Inf, 0))

# counts:
> summarise_all(df, ~sum(is_missing(.)))
  a b c d e
1 3 1 3 0 2

# percentage:
> summarise_all(df, ~(sum(is_missing(.) / nrow(df))))
     a    b    c d   e
1 0.75 0.25 0.75 0 0.5

Narcotism answered 14/3, 2020 at 12:45 Comment(0)

If you want to do it for particular column, then you can also use this

length(which(is.na(airquality[1])==T))

Endgame answered 9/6, 2014 at 6:24 Comment(1)

You don't need to compare a logical vector to T. You can also count the number of TRUE elements in a logical vector by summing it. – Anabasis 1/12, 2014 at 18:3

ExPanDaR’s package function prepare_missing_values_graph can be used to explore panel data:

Obstruent answered 5/4, 2019 at 4:11 Comment(0)

For piping you could write:

# Counts 
df %>% is.na() %>% colSums()

# % of missing rounded to 2 decimals 
df %>% summarise_all(.funs = ~round(100*sum(is.na(.))/length(.),2))

Neoterize answered 10/12, 2021 at 9:46 Comment(0)

summary(airquality) shows NAs by default, unlike table() for vectors which requires useNA = "ifany". (Bug: don't try table() on a dataframe or you may get a memory leak.)

My new favorite way to summarize dataframe values, with n_missing and complete_rate for all column types, is with skimr:

> skim(airquality)
── Data Summary ────────────────────────
                           Values    
Name                       airquality
Number of rows             153       
Number of columns          6         
_______________________              
Column type frequency:               
  numeric                  6         
________________________             
Group variables            None      

── Variable type: numeric ────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate   mean    sd   p0   p25   p50   p75  p100 hist 
1 Ozone                37         0.758  42.1  33.0   1    18    31.5  63.2 168   ▇▃▂▁▁
2 Solar.R               7         0.954 186.   90.1   7   116.  205   259.  334   ▅▃▅▇▅
3 Wind                  0         1       9.96  3.52  1.7   7.4   9.7  11.5  20.7 ▂▇▇▃▁
4 Temp                  0         1      77.9   9.47 56    72    79    85    97   ▂▃▇▇▃
5 Month                 0         1       6.99  1.42  5     6     7     8     9   ▇▇▇▇▇
6 Day                   0         1      15.8   8.86  1     8    16    23    31   ▇▇▇▇▆

Aside from the printed summary, you can also get summary statistics as a dataframe returned from skim(). You can also customize the statistics reported with sfl().

Intern answered 24/5, 2024 at 0:32 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags