Specify different types of missing values (NAs)
Asked Answered
P

3

16

I'm interested to specify types of missing values. I have data that have different types of missing and I am trying to code these values as missing in R, but I am looking for a solution were I can still distinguish between them.

Say I have some data that looks like this,

set.seed(667) 
df <- data.frame(a = sample(c("Don't know/Not sure","Unknown","Refused","Blue", "Red", "Green"),  20, rep=TRUE), b = sample(c(1, 2, 3, 77, 88, 99),  10, rep=TRUE), f = round(rnorm(n=10, mean=.90, sd=.08), digits = 2), g = sample(c("C","M","Y","K"),  10, rep=TRUE) ); df
#                      a  b    f g
# 1              Unknown  2 0.78 M
# 2              Refused  2 0.87 M
# 3                  Red 77 0.82 Y
# 4                  Red 99 0.78 Y
# 5                Green 77 0.97 M
# 6                Green  3 0.99 K
# 7                  Red  3 0.99 Y
# 8                Green 88 0.84 C
# 9              Unknown 99 1.08 M
# 10             Refused 99 0.81 C
# 11                Blue  2 0.78 M
# 12               Green  2 0.87 M
# 13                Blue 77 0.82 Y
# 14 Don't know/Not sure 99 0.78 Y
# 15             Unknown 77 0.97 M
# 16             Refused  3 0.99 K
# 17                Blue  3 0.99 Y
# 18               Green 88 0.84 C
# 19             Refused 99 1.08 M
# 20                 Red 99 0.81 C

If I now make two tables my missing values ("Don't know/Not sure","Unknown","Refused" and 77, 88, 99) are included as regular data,

table(df$a,df$g)
#                     C K M Y
# Blue                0 0 1 2
# Don't know/Not sure 0 0 0 1
# Green               2 1 2 0
# Red                 1 0 0 3
# Refused             1 1 2 0
# Unknown             0 0 3 0

and

table(df$b,df$g)
#    C K M Y
# 2  0 0 4 0
# 3  0 2 0 2
# 77 0 0 2 2
# 88 2 0 0 0
# 99 2 0 2 2

I now recode the three factor levels "Don't know/Not sure","Unknown","Refused" into <NA>

is.na(df[,c("a")]) <- df[,c("a")]=="Don't know/Not sure"|df[,c("a")]=="Unknown"|df[,c("a")]=="Refused"

and remove the empty levels

df$a <- factor(df$a) 

and the same is done with the numeric values 77, 88, and 99

is.na(df) <- df=="77"|df=="88"|df=="99"

table(df$a, df$g, useNA = "always")       
#       C K M Y <NA>
# Blue  0 0 1 2    0
# Green 2 1 2 0    0
# Red   1 0 0 3    0
# <NA>  1 1 5 1    0

table(df$b,df$g, useNA = "always")
#      C K M Y <NA>
# 2    0 0 4 0    0
# 3    0 2 0 2    0
# <NA> 4 0 4 4    0

Now the missing categories are recode into NA but they are all lumped together. Is there a way in a to recode something as missing, but retain the original values? I want R to thread "Don't know/Not sure","Unknown","Refused" and 77, 88, 99 as missing, but I want to be able to still have the information in the variable.

Philanthropy answered 18/4, 2013 at 4:16 Comment(5)
How about adding another column to the df called isNA which will hold true if the value is missing? or isNA column can directly hold NA and 0. It depends on rest of your code.Phenix
That would properly work, but it's more of workaround then a solution that would work seamlessly with the rest of my code–as you also point out. Would you care to demonstrate it in an example?Philanthropy
It is difficult to predict the effect on rest of the code. may be you can write your own my.table that uses my.is.na which returns TRUE for "Don't know/Not sure","Unknown","Refused"Phenix
It looks like you've provided us with summarized data. Do you have the data in a format that is a step before this one? If so it would just be a matter of factoring.Kimono
@BrandonBertelsen, thank you for your question (and your answer). The dummy data I've provided is quite close to how my real data looks. As I mentioned in my comment to @Maxim.K I could have been a bit more precise about the variable a, but aside from that the data I provided in the question is quite close to how my real data looks.Philanthropy
H
23

To my knowledge, base R doesn't have an in-built way to handle different NA types. (editor: It does: NA_integer_, NA_real_, NA_complex_, and NA_character. See ?base::NA.)

One option is to use a package which does so, for instance "memisc". It's a little bit of extra work, but it seems to do what you're looking for.

Here's an example:

First, your data. I've made a copy since we will be making some pretty significant changes to the dataset, and it's always nice to have a backup.

set.seed(667) 
df <- data.frame(a = sample(c("Don't know/Not sure", "Unknown", 
                              "Refused", "Blue", "Red", "Green"),
                            20, replace = TRUE), 
                 b = sample(c(1, 2, 3, 77, 88, 99), 10, 
                            replace = TRUE), 
                 f = round(rnorm(n = 10, mean = .90, sd = .08), 
                           digits = 2), 
                 g = sample(c("C", "M", "Y", "K"), 10, 
                            replace = TRUE))
df2 <- df

Let's factor variable "a":

df2$a <- factor(df2$a, 
                levels = c("Blue", "Red", "Green", 
                           "Don't know/Not sure",
                           "Refused", "Unknown"),
                labels = c(1, 2, 3, 77, 88, 99))

Load the "memisc" library:

library(memisc)

Now, convert variables "a" and "b" to items in "memisc":

df2$a <- as.item(as.character(df2$a), 
                  labels = structure(c(1, 2, 3, 77, 88, 99),
                                     names = c("Blue", "Red", "Green", 
                                               "Don't know/Not sure",
                                               "Refused", "Unknown")),
                  missing.values = c(77, 88, 99))
df2$b <- as.item(df2$b, 
                 labels = c(1, 2, 3, 77, 88, 99), 
                 missing.values = c(77, 88, 99))

By doing this, we have a new data type. Compare the following:

as.factor(df2$a)
#  [1] <NA>  <NA>  Red   Red   Green Green Red   Green <NA>  <NA>  Blue 
# [12] Green Blue  <NA>  <NA>  <NA>  Blue  Green <NA>  Red  
# Levels: Blue Red Green
as.factor(include.missings(df2$a))
#  [1] *Unknown             *Refused             Red                 
#  [4] Red                  Green                Green               
#  [7] Red                  Green                *Unknown            
# [10] *Refused             Blue                 Green               
# [13] Blue                 *Don't know/Not sure *Unknown            
# [16] *Refused             Blue                 Green               
# [19] *Refused             Red                 
# Levels: Blue Red Green *Don't know/Not sure *Refused *Unknown

We can use this information to create tables behaving the way you describe, while retaining all the original information.

table(as.factor(include.missings(df2$a)), df2$g)
#                       
#                        C K M Y
#   Blue                 0 0 1 2
#   Red                  1 0 0 3
#   Green                2 1 2 0
#   *Don't know/Not sure 0 0 0 1
#   *Refused             1 1 2 0
#   *Unknown             0 0 3 0
table(as.factor(df2$a), df2$g)
#        
#         C K M Y
#   Blue  0 0 1 2
#   Red   1 0 0 3
#   Green 2 1 2 0
table(as.factor(df2$a), df2$g, useNA="always")
#        
#         C K M Y <NA>
#   Blue  0 0 1 2    0
#   Red   1 0 0 3    0
#   Green 2 1 2 0    0
#   <NA>  1 1 5 1    0

The tables for the numeric column with missing data behaves the same way.

table(as.factor(include.missings(df2$b)), df2$g)
#      
#       C K M Y
#   1   0 0 0 0
#   2   0 0 4 0
#   3   0 2 0 2
#   *77 0 0 2 2
#   *88 2 0 0 0
#   *99 2 0 2 2
table(as.factor(df2$b), df2$g, useNA="always")
#       
#        C K M Y <NA>
#   1    0 0 0 0    0
#   2    0 0 4 0    0
#   3    0 2 0 2    0
#   <NA> 4 0 4 4    0

As a bonus, you get the facility to generate nice codebooks:

> codebook(df2$a)
========================================================================

   df2$a

------------------------------------------------------------------------

   Storage mode: character
   Measurement: nominal
   Missing values: 77, 88, 99

            Values and labels    N    Percent 

    1   'Blue'                   3   25.0 15.0
    2   'Red'                    4   33.3 20.0
    3   'Green'                  5   41.7 25.0
   77 M 'Don't know/Not sure'    1         5.0
   88 M 'Refused'                4        20.0
   99 M 'Unknown'                3        15.0

However, I do also suggest you read the comment from @Maxim.K about what really constitutes missing values.

Harar answered 21/4, 2013 at 10:55 Comment(4)
+1 very good detailed answer! I like the '*' in the rownames when include.missings :)Paleozoology
Thank your for a good detailed answer, as @Paleozoology also points out.Philanthropy
+1 really detailed, nice. R does have a way to handle different NA types, but I don't know if you can make use of it. It must do to be able to do class( c(1,2,NA) ) which is "numeric" and class( c("a","b",NA) ) which is "character"?Porshaport
What other packages let you use different kind of missings simultaneously? I have a dataset with many variables, some numeric, some dates, and I want to code three different kind of missings: errors, unknown and missings generated because of the reshaping of the data.Evadnee
P
5

To retain the original values, you can create new columns where you code the NA information , for example :

df <- transform(df,b.na = ifelse(b %in% c('77','88','99'),NA,b))
df <- transform(df,a.na = ifelse(a %in% 
                        c("Don't know/Not sure","Unknown","Refused"),NA,a))

Then you can do something like this :

   table(df$b.na , df$g)
    C K M Y
  2 0 0 4 0
  3 0 2 0 2

Another option without creating new columns is to use ,exclude option like this , to set the non desired values to NULL,( different of missing values)

table(df$a,df$g,
      exclude=c('77','88','99',"Don't know/Not sure","Unknown","Refused")) 
       C K M Y
  Blue  0 0 1 2
  Green 2 1 2 0
  Red   1 0 0 3

You can define some global constants( even it is not recommnded ) to group your "missing values", and use them in the rest of your program. Something like this :

B_MISSING <- c('77','88','99')
A_MISSING <- c("Don't know/Not sure","Unknown","Refused")
Paleozoology answered 18/4, 2013 at 6:46 Comment(8)
Thank you for responding to my question. I didn't know about the exclude option. That is an interesting solution. I'm still somewhat surprised that R only have one category of missing values.Philanthropy
@EricFail R have one missing are basically a logical values but can also have different types: NA_integer_, NA_real_, NA_complex_ and NA_character_. You can see my edit for a "global" solution.Paleozoology
Strictly speaking, these are not (all) missings. "Don't know" is not a missing, it is a valid answer category, and in many cases should be treated as such. "Refused" also contains information, whereas "Unknown" is probably a true missing. I would just create an additional column with these three subcategories and refer to them whenever I needed, while using regular NA for statistical techniques that don't differentiate.Colver
@Maxim.K, your comment made me realize that I could have been more precise in my question. The variable a in my example should have been more like this c("Unknown", "Refused", 1, 1, 2, 2, 1, 2, "Unknown", "Refused", 3, 2, 3, "Don't know/Not sure", "Unknown", "Refused", 3, 2, "Refused", 1) and what I am interested in is storing a in a way where I can summarize it, but without losing the distinction between "Don't know/Not sure","Unknown","Refused." Does that make sense?Philanthropy
@agstudy, regarding the global constants, would this be part of my .Rprofile?Philanthropy
@EricFail, should variable "a" be numeric? categorical? factor?Harar
@AnandaMahto, in the example in the initial question a is a factor. In the comment above it's a character variable. It can be anything, if it helps answer the question.Philanthropy
@EricFail, in that case, you can try modifying what I've shared as follows: df2$a <- factor(df2$a, levels = c(1, 2, 3, "Don't know/Not sure", "Refused", "Unknown"), labels = c(1, 2, 3, "Don't know/Not sure", "Refused", "Unknown")); df2$a <- as.item(as.character(df2$a), labels = structure(c(1, 2, 3, "Don't know/Not sure", "Refused", "Unknown"), names = c(1, 2, 3, "Don't know/Not sure", "Refused", "Unknown")), missing.values = c("Don't know/Not sure", "Refused", "Unknown")). Hope that helps.Harar
L
5

If you are willing to stick to numeric values then NA, Inf, -Inf, and NaN could be used for different missing values. You can then use is.finite to distinguish between them and normal values:

x <- c(NA, Inf, -Inf, NaN, 1)
is.finite(x)
## [1] FALSE FALSE FALSE FALSE  TRUE

is.infinite, is.nan and is.na are also useful here.

We could have a special print function that displays them in a more meaningful way or even create a special class but even without that the above would divide the data into finite and multiple non-finite values.

Luralurch answered 24/4, 2013 at 15:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.