Find NA values after using addNA()

Asked 25/6, 2013 at 15:58 Answered 25/2, 2024 at 2:19

I have a data frame with a bunch of categorical variables. Some of them contain NA's and I use the addNA function to convert them to an explicit factor level. My problem comes when I try to treat them as NA's they don't seem to register.

Here's my example data set and attempts to 'find' NA's:

df1 <- data.frame(id = 1:200, y =rbinom(200, 1, .5),
                  var1 = factor(rep(c('abc','def','ghi','jkl'),50)))
df1$var2 <- factor(rep(c('ab c','ghi','jkl','def'),50))
df1$var3 <- factor(rep(c('abc','ghi','nop','xyz'),50))

df1[df1$var1 == 'abc','var1'] <- NA

df1$var1 <- addNA(df1$var1)

df1$isNaCol <- ifelse(df1$var1 == NA, 1, 0);summary(df1$isNaCol)
df1$isNaCol <- ifelse(is.na(df1$var1), 1, 0);summary(df1$isNaCol)
df1$isNaCol <- ifelse(df1$var1 == 'NA', 1, 0);summary(df1$isNaCol)
df1$isNaCol <- ifelse(df1$var1 == '<NA>', 1, 0);summary(df1$isNaCol)

Also when I type ??addNA I don't get any matches. Is this a gray-market function or something? Any suggestions would be appreciated.

Dehorn answered 25/6, 2013 at 15:58 Comment(2)

FWIW, any(is.na(as.character(df1$var1))) returns TRUE... But I'm not sure why it isn't working directly... x=factor('a'); x[1]=NA; addNA(x); is.na(x) returns TRUE is it should... – Cordalia 25/6, 2013 at 16:7

@Cordalia when you set x[1]=NA, you're setting the level index to NA, not the value. See as.numeric(x) vs as.numeric(df1$var1). is.na looks at the level indexes. – Olnay 25/6, 2013 at 16:15

Testing equality to NA with the usual comparison operators always yields NA---you want is.na. Additionally, calling is.na on a factor test each level index (not the value associated with that index), so you want to convert the factor to a character vector first.

df1$isNaCol <- ifelse(is.na(as.character(df1$var1)), 1, 0);summary(df1$isNaCol)

Olnay answered 25/6, 2013 at 16:10 Comment(1)

@GavinSimpson see my comment on the question. Something is different about our setups. I don't get the output in your answer. – Olnay 25/6, 2013 at 16:26

Note that this is done with the OP's data before the call to addNA().

It is instructive to see what addNA() does with this data.

> head(df1$var1)
[1] <NA> def  ghi  jkl  <NA> def 
Levels: abc def ghi jkl
> levels(df1$var1)
[1] "abc" "def" "ghi" "jkl"
> head(addNA(df1$var1))
[1] <NA> def  ghi  jkl  <NA> def 
Levels: abc def ghi jkl <NA>
> levels(addNA(df1$var1))
[1] "abc" "def" "ghi" "jkl" NA

addNA is altering the levels of the factor such that missing-ness (NA) is a level where by default R ignores it as what level the NA values take is, of course, missing. It is also stripping out the NA information - in a sense it is no longer unknown but part of a category "missing".

To look at the help for addNA us ?addNA.

If we look at the definition of addNA we see that all it is doing is altering the levels

of the factor, not changing the data any:

> addNA
function (x, ifany = FALSE) 
{
    if (!is.factor(x)) 
        x <- factor(x)
    if (ifany & !any(is.na(x))) 
        return(x)
    ll <- levels(x)
    if (!any(is.na(ll))) 
        ll <- c(ll, NA)
    factor(x, levels = ll, exclude = NULL)
}

Note that it doesn't otherwise change the data - the NA are still there in the factor. We can replicate most of the behaviour of addNA via:

with(df1, factor(var1, levels = c(levels(var1), NA), exclude = NULL))

> head(with(df1, factor(var1, levels = c(levels(var1), NA), exclude = NULL)))
[1] <NA> def  ghi  jkl  <NA> def 
Levels: abc def ghi jkl <NA>

However because NA is now a level, those entries are not indicated as being missing via is.na() That explains the second comparison you do not working (where you use is.na()).

The only nicety you get from addNA is that it doesn't add NA as a level if it already exists as one. Also, via the ifany you can stop it adding NA as a level if there are no NAs in the data.

Where you are going wrong is attempting to compare an NA with something using the usual comparison methods (except your second example). If we don't know what value and NA observation takes, how can we compare it with something? Well, we can't, other than with the internal representation of NA. This is what is done by the is.na() function:

> with(df1, head(is.na(var1), 10))
 [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE

Hence I would do (without using addNA at all)

df1 <- transform(df1, isNaCol = is.na(var1))

> head(df1)
  id y var1 var2 var3 isNaCol
1  1 1 <NA> ab c  abc    TRUE
2  2 0  def  ghi  ghi   FALSE
3  3 0  ghi  jkl  nop   FALSE
4  4 0  jkl  def  xyz   FALSE
5  5 0 <NA> ab c  abc    TRUE
6  6 1  def  ghi  ghi   FALSE

If you want that as a 1, 0, variable, just add as.numeric() as in

df1 <- transform(df1, isNaCol = as.numeric(is.na(var1)))

Where I think you are really going wrong is in wanting to attach an NA level to the factor. I see addNA() as a convenience function for use in things like table(), and even that has arguments to not need the prior use of addNA(), e.g.:

> with(df1, table(var1, useNA = "ifany"))
var1
 abc  def  ghi  jkl <NA> 
   0   50   50   50   50

Sestet answered 25/6, 2013 at 16:18 Comment(11)

I get different output for with(df1, head(is.na(var1), 10)) (all FALSE) – Olnay 25/6, 2013 at 16:22

@MatthewPlourde Not here. I just rebuilt from the OP's data to check. And restarted R to check. Also this is what is.na() has been doing for as long as I can remember. There is no labels vs levels issue here - the value is NA and is.na() informs you of that. – Sestet 25/6, 2013 at 16:25

hmm...what version of R (3.0.1 here)? I just restarted vanilla and get the same output as before. – Olnay 25/6, 2013 at 16:29

@MatthewPlourde R version 3.0.1 (2013-05-16) – Sestet 25/6, 2013 at 16:31

@GavingSimpson well that makes a lot of sense. I assume you tried it with a clean session? – Olnay 25/6, 2013 at 16:34

@MatthewPlourde Yes, restarted to check. Are you running this after addNA has been called? or before? I am not running the addNA step as that seems superfluous to what the OP wants. – Sestet 25/6, 2013 at 16:35

@MatthewPlourde And yes, I see now: Try is.na(addNA(df1$var1)) and you'll see the issue. addNA is getting rid of the NA information. – Sestet 25/6, 2013 at 16:36

after. It's not superfluous. It adds NA as a level, replacing the NA indexes with the index of the NA level. See as.numeric(df1$var1) before and after the addNA step. So if you want to have NA as a level, you'd need to convert to character before doing what OP wants. – Olnay 25/6, 2013 at 16:40

@MatthewPlourde Right, by superfluous I really don't see why addNA needs to come into this. By making NA a level you effectively said that it isn't NA but part of a category NA (or missing - not stated say). In which case is.na() is right, if confusing, when it states that no values are NA in the factor. I don't see the point in adding NA in a persistent manner here - perhaps this is just an artificial example or one small part of a larger analysis, but it seems to me that addNA is perhaps best used on the fly where you want to include NA as a level and hence counted. – Sestet 25/6, 2013 at 16:49

agreed. Well, if OP needs NA as a level or not, he has an answer for each case. – Olnay 25/6, 2013 at 17:6

@MatthewPlourde Indeed. The discussion is perhaps one of semantics. I feel you have Answered the question as posed though, whereas I answer a different question, one which seems more natural to me, but perhaps not to the OP :-) – Sestet 25/6, 2013 at 17:10

Anything compared to NA is NA; this is why your first summary is all NA.

The addNA function changes any NA observations in your factor to a new level. This level is then given the label NA (of character mode). The underlying variable itself no longer has any NAs. This is why your second summary is all 0.

To see how many observations have the NA level, use what Matthew Plourde posted.

Demeter answered 25/6, 2013 at 16:16 Comment(0)

I'm amazed such a simple question doesn't have a simple answer. I ran into the same situation I needed NA levels for a subset of my data pipeline. It turns out is.na() works on the levels but not on the factor variable is itself. So my solution is based on that.

# create a factor variable with two levels and missing values
set.seed(1)
x <- factor(sample(c(0,1,NA), size = 10, replace = T))

x
#[1] 0    <NA> 0    1    0    <NA> <NA> 1    1    <NA>
#Levels: 0 1

# is.na works...
is.na(x)
#[1] FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE

# add NA as a level
x <- addNA(x)
x
#[1] 0    <NA> 0    1    0    <NA> <NA> 1    1    <NA>
#Levels: 0 1 <NA>

# is.na doesn't work...
is.na(x)
#[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

# get the level that is NA
na_level <- which(is.na(levels(x))) # 3

# Same as if using is.na() before using addNA()
!x %in% (levels(x)[-na_level])
# [1] FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE

Appose answered 25/2, 2024 at 2:19 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags