Replace <NA> in a factor column
Asked Answered
M

6

48

I want to replace <NA> values in a factors column with a valid value. But I can not find a way. This example is only for demonstration. The original data comes from a foreign csv file I have to deal with.

df <- data.frame(a=sample(0:10, size=10, replace=TRUE),
                 b=sample(20:30, size=10, replace=TRUE))
df[df$a==0,'a'] <- NA
df$a <- as.factor(df$a)

Could look like this

      a  b
1     1 29
2     2 23
3     3 23
4     3 22
5     4 28
6  <NA> 24
7     2 21
8     4 25
9  <NA> 29
10    3 24

Now I want to replace the <NA> values with a number.

df[is.na(df$a), 'a'] <- 88
In `[<-.factor`(`*tmp*`, iseq, value = c(88, 88)) :
  invalid factor level, NA generated

I think I missed a fundamental R concept about factors. Am I? I can not understand why it doesn't work. I think invalid factor level means that 88 is not a valid level in that factor, right? So I have to tell the factor column that there is another level?

Millisent answered 24/8, 2016 at 14:46 Comment(6)
I don't understand why you have the line of code, df$a <- as.factor(df$a) why do you want that column to be factors?Lyophobic
@buhtz: if one does not sample a value of 0 in the data.frame call will not be able to replicate your problem, maybe better to set.seed().Jar
@000andy8484 Thanks for that hint. I will pin that to my notes for the next time.Millisent
@user1945827 It is just to imitate my real data (commin from a foreign csv file) and real situation plus providing a minimal example.Millisent
I would suggest that the factor is a red herring. When you import the data using the function read.csv() you need to set, stringsAsFactors=F and this will remove any factors in your resulting data.frame.Lyophobic
@user1945827 Awsome! Thanks.Millisent
S
79

1) addNA If fac is a factor addNA(fac) is the same factor but with NA added as a level. See ?addNA

To force the NA level to be 88:

facna <- addNA(fac)
levels(facna) <- c(levels(fac), 88)

giving:

> facna
 [1] 1  2  3  3  4  88 2  4  88 3 
Levels: 1 2 3 4 88

1a) This can be written in a single line as follows:

`levels<-`(addNA(fac), c(levels(fac), 88))

2) factor It can also be done in one line using the various arguments of factor like this:

factor(fac, levels = levels(addNA(fac)), labels = c(levels(fac), 88), exclude = NULL)

2a) or equivalently:

factor(fac, levels = c(levels(fac), NA), labels = c(levels(fac), 88), exclude = NULL)

3) ifelse Another approach is:

factor(ifelse(is.na(fac), 88, paste(fac)), levels = c(levels(fac), 88))

4) forcats The forcats package has a function for this:

library(forcats)

fct_na_value_to_level(fac, "88")
## [1] 1  2  3  3  4  88 2  4  88 3 
## Levels: 1 2 3 4 88

Note: We used the following for input fac

fac <- structure(c(1L, 2L, 3L, 3L, 4L, NA, 2L, 4L, NA, 3L), .Label = c("1", 
"2", "3", "4"), class = "factor")

Update: Have improved (1) and added (1a). Later added (4).

Soundproof answered 24/8, 2016 at 14:55 Comment(3)
Hey :) I did 1a for a column in my data.frame. The level appears but if I want to calculate means for specific conditions, let say for all b in the above example that have the level NA I get NaN. I tried mean(df$b[df$a==NA]) Also str(df) gives me: Factor w/ 3 levels "1", "2", "3", NA:... I think what I need is "1", "2", "3", "NA"right?Beginning
Option 3) worked for me and I could correctly apply it with a pipe. I tested with and without paste(fac) inside the ifelse statement and both worked fine for me. Any specific reason for why the paste needs to be included?Ddene
So that the factor is rebuilt from scratch.Soundproof
S
9

I had similar issues and I want to add what I consider the most pragmatic (and also tidy) solution:

Convert the column to a character column, use mutate and a simple ifelse-statement to change the NA values to what you want the factor level to be (I have chosen "None"), convert it back to a factor column:

df %>% mutate(
a = as.character(a),
a = ifelse(is.na(a), "None", a),
a = as.factor(a)
)

Clean and painless because you do not actually have to dabble with NA values when they occur in a factor column. You bypass the weirdness and end up with a clean factor variable.

Also, in response to the comment made below regarding multiple columns: You can wrap the statements in a function and use mutate_if to select all factor variables or, if you know the names of the columns of concern, mutate_at to apply that function:

replace_factor_na <- function(x){
  x <- as.character(x)
  x <- if_else(is.na(x), "None", x)
  x <- as.factor(x)
}

df <- df %>%
  mutate_if(is.factor, replace_factor_na)
Stanwood answered 25/4, 2020 at 13:19 Comment(3)
it worked and i think this is the best answer tidywise.Dovekie
how do you do it with mutate_at. imagine one wants to do it for multiple columnsRadarman
Moj´s question was valid, especially for large datasets, so I extended my answer to be more flexible and to fix several columns in one go.Stanwood
O
8

other way to do is:

#check levels
levels(df$a)
#[1] "3"  "4"  "7"  "9"  "10"

#add new factor level. i.e 88 in our example
df$a = factor(df$a, levels=c(levels(df$a), 88))

#convert all NA's to 88
df$a[is.na(df$a)] = 88

#check levels again
levels(df$a)
#[1] "3"  "4"  "7"  "9"  "10" "88"
Oina answered 30/9, 2017 at 6:20 Comment(0)
A
6

My way would be a little bit traditional by using factor function:

a <- factor(a, 
            exclude = NULL, 
            levels = c(levels(a), NA),
            labels = c(levels(a), "None"))

You can replace "None" with appropriate replacement that you want (0L for example)

Aboveboard answered 6/11, 2019 at 15:29 Comment(2)
I think this is the neatest answer of all, done within just one basic function. This should be upvoted more.Nonattendance
I'm glad to here that, thanks so muchAboveboard
I
5

The basic concept of a factor variable is that it can only take specific values, i.e., the levels. A value not in the levels is invalid.

You have two possibilities:

If you have a variable that follows this concept, make sure to define all levels when you create it, even those without corresponding values.

Or make the variable a character variable and work with that.

PS: Often these problems result from data import. For instance, what you show there looks like it should be a numeric variable and not a factor variable.

Iodize answered 24/8, 2016 at 14:53 Comment(1)
It is hard to decide where to put the green mark here! ;) Your answer provided me the background info about the basic concept I missed before. Thank you very much.Millisent
J
4

The problem is that NA is not a level of that factor:

> levels(df$a)
[1] "2"  "4"  "5"  "9"  "10"

You can't change it straight away, but the following will do the trick:

df$a <- as.numeric(as.character(df$a))
df[is.na(df$a),1] <- 88
df$a <- as.factor(df$a)
> df$a
 [1] 9  88 3  9  5  9  88 8  3  9 
Levels: 3 5 8 9 88
> levels(df$a)
[1] "3"  "5"  "8"  "9"  "88"
Jar answered 24/8, 2016 at 15:10 Comment(1)
df$a <- as.numeric(levels(df$a))[df$a] is a slightly more efficient variant for as.numeric(as.character()).Jar

© 2022 - 2024 — McMap. All rights reserved.