Subset of rows containing NA (missing) values in a chosen column of a data frame
Asked Answered
M

8

131

We have a data frame from a CSV file. The data frame DF has columns that contain observed values and a column (VaR2) that contains the date at which a measurement has been taken. If the date was not recorded, the CSV file contains the value NA, for missing data.

Var1  Var2 
10    2010/01/01
20    NA
30    2010/03/01

We would like to use the subset command to define a new data frame new_DF such that it only contains rows that have an NA' value from the column (VaR2). In the example given, only Row 2 will be contained in the new DF.

The command

new_DF<-subset(DF, DF$Var2=="NA") 

does not work, the resulting data frame has no row entries.

If in the original CSV file the Value NA are exchanged with NULL, the same command produces the desired result:

new_DF <- subset(DF, DF$Var2=="NULL")

How can I get this method working, if for the character string the value NA is provided in the original CSV file?

Macdonald answered 2/11, 2011 at 12:53 Comment(0)
S
189

Never use =='NA' to test for missing values. Use is.na() instead. This should do it:

new_DF <- DF[rowSums(is.na(DF)) > 0,]

or in case you want to check a particular column, you can also use

new_DF <- DF[is.na(DF$Var),]

In case you have NA character values, first run

Df[Df=='NA'] <- NA

to replace them with missing values.

Statutable answered 2/11, 2011 at 13:2 Comment(3)
Thanks for your fast answer (this was quick)! Indeed, due to the csv-delivery of the data, the 'NA' are character values and your second statement might be very useful. Can you also clarify your first statement? The use of rowSums() is not clear for me, since I will only check a particular column (there are plenty of columns). If that particular column (in the example it would be column Var2) has there an 'NA' character string (I will replace it with your second statement), then I would like to choose the entire row to be part of the new data frame.Macdonald
@Macdonald : updated. THe point is to use is.na, I wrongly interpreted you wanted to check all variables.Statutable
should that be new_DF <- DF[is.na(DF$Var),], i.e. there appears to be an extra ( bracket after DF[ ?Scleroma
V
66

complete.cases gives TRUE when all values in a row are not NA

DF[!complete.cases(DF), ]
Vasos answered 3/11, 2017 at 9:35 Comment(1)
Great option if one prefers base R! @user3226167Inartistic
G
42

NA is a special value in R, do not mix up the NA value with the "NA" string. Depending on the way the data was imported, your "NA" and "NULL" cells may be of various type (the default behavior is to convert "NA" strings to NA values, and let "NULL" strings as is).

If using read.table() or read.csv(), you should consider the "na.strings" argument to do clean data import, and always work with real R NA values.

An example, working in both cases "NULL" and "NA" cells :

DF <- read.csv("file.csv", na.strings=c("NA", "NULL"))
new_DF <- subset(DF, is.na(DF$Var2))
Goddamn answered 2/11, 2011 at 13:32 Comment(4)
Thanks for your answer. If I understand it correctly the first statement would do the same as Df[Df=='NA']<-NA in the example of Joris? The (small) difference then would be that it is done in your statment direcly in the beginning, when the data frame is created (this is a very clean programming method and I therefore like it).Macdonald
Exactly. Joris suggested to replace "NA" strings by NA values manually, here i only suggest to use the "na.strings" feature of read.table() to achieve the same purpose.Goddamn
Joris' answer is actually the "preferred" way to accomplish this feat (if you are writing this in a script). See: #9860590Uralian
@Uralian : Two distinct ideas here, the topic you cite says "[" should be prefered on "subset", but we were talking about the "na.strings" argument in read.table(), my subset was here only to visualize the effects.Goddamn
S
28
new_data <- data %>% filter_all(any_vars(is.na(.))) 

This should create a new data frame (new_data) with only the missing values in it.

Works best to keep a track of values that you might later drop because they had some columns with missing observations (NA).

Sherylsheryle answered 27/7, 2018 at 12:1 Comment(0)
H
7

Try changing this:

new_DF<-dplyr::filter(DF,is.na(Var2)) 
Hanna answered 21/11, 2017 at 23:57 Comment(3)
Could you explain why this works, what this does etc.?Racquelracquet
new_DF<-dplyr::filter(DF,is.na(Var2)) it basically use the filter function of dplyr package and filter out any observation in Var2 column which satisfy the condition is.na ie they pick all the observation with NAHanna
More nicely expressed as DF %>% filter(is.na(Var2)) after library(dplyr).Equimolecular
H
7

Since dplyrs filter_all has been superseded

Scoped verbs (_if, _at, _all) have been superseded by the use of across() in an existing verb.

and the usage of across() in filter() is deprecated, Ronak Pol's answer needs a small update. To find all rows with an NA anywhere, we could use

library(dplyr)

DF %>% 
  filter(if_any(everything(), is.na))

to get

# A tibble: 1 x 2
   Var1 Var2  
  <dbl> <date>
1    20 NA   
Heaume answered 13/4, 2022 at 22:57 Comment(0)
C
0

In addition to the main answer, if you want all rows with one or more NAs:

DF_rows_with_NAs = DF[rowSums(is.na(DF)) > 0, ]

DF_rows_with_NAs

Verification :

sum(is.na(DF_rows_with_NAs)) 

This should equal the total NAs of the whole DF.

If you want all columns with NAs:

DF_columns_with_NAs=DF[ ,colSums(is.na(DF)) > 0]

DF_columns_with_NAs

Verification:

sum(is.na(ts_columns_with_NAs))

This should equal the total NAs of the whole DF.

Chandless answered 6/2, 2024 at 4:10 Comment(0)
M
-1

Prints all the rows with NA data:

tmp <- data.frame(c(1,2,3),c(4,NA,5));
tmp[round(which(is.na(tmp))/ncol(tmp)),]
Magulac answered 29/5, 2016 at 6:28 Comment(1)
@ZheyuanLi If you don't like the answer, simply down-vote it. Editing the answer to recommend flagging is NOT the appropiate action. Leave a comment if you feel the need to.Pileup

© 2022 - 2025 — McMap. All rights reserved.