Replace all NA with FALSE in selected columns in R

Asked 2/9, 2011 at 3:59 Answered 28/2, 2023 at 8:17

Solved r dataframe na missing-data imputation

I have a question similar to this one, but my dataset is a bit bigger: 50 columns with 1 column as UID and other columns carrying either TRUE or NA, I want to change all the NA to FALSE, but I don't want to use explicit loop.

Can plyr do the trick? Thanks.

UPDATE #1

Thanks for quick reply, but what if my dataset is like below:

df <- data.frame(
  id = c(rep(1:19),NA),
  x1 = sample(c(NA,TRUE), 20, replace = TRUE),
  x2 = sample(c(NA,TRUE), 20, replace = TRUE)
)

I only want X1 and X2 to be processed, how can this be done?

Extensile answered 2/9, 2011 at 3:59 Comment(0)

If you want to do the replacement for a subset of variables, you can still use the is.na(*) <- trick, as follows:

df[c("x1", "x2")][is.na(df[c("x1", "x2")])] <- FALSE

IMO using temporary variables makes the logic easier to follow:

vars.to.replace <- c("x1", "x2")
df2 <- df[vars.to.replace]
df2[is.na(df2)] <- FALSE
df[vars.to.replace] <- df2

Rhine answered 2/9, 2011 at 4:46 Comment(3)

I know this is an old post, but would you explain the first line to me? I get the logic when you break it down using temp variables, but I'd like to understand the one line form. I thought I was familiar with subsetting but I don't understand the [][]. I searched "double brackets" but that turned up something different. – Talkingto 2/5, 2013 at 13:57

@Talkingto You just have to read the double brackets as different subsets from left to right. For example, if x <- 1:10, then x[5:10][1:4] will give you the vector 5 6 7 8. In multiple steps, you could take the first subset and call it y, y <- x[5:10] which is 5 6 7 8 9 10. And then subset that vector y[1:4], which gives you 5 6 7 8 again. – Cymograph 7/10, 2014 at 14:39

You can also use the column position instead of explicitly naming them, which is useful when you have a lot of variables to convert or if they have long names: df2[,14:16][is.na(df2[,14:16])] <- 0, for instance, replaces NA with 0 in columns 14, 15, and 16 of data frame, df2. – Congress 7/5, 2015 at 14:54

tidyr::replace_na excellent function.

df %>%
  replace_na(list(x1 = FALSE, x2 = FALSE))

This is such a great quick fix. the only trick is you make a list of the columns you want to change.

Cawley answered 19/9, 2016 at 13:55 Comment(0)

Try this code:

df <- data.frame(
  id = c(rep(1:19), NA),
  x1 = sample(c(NA, TRUE), 20, replace = TRUE),
  x2 = sample(c(NA, TRUE), 20, replace = TRUE)
)
replace(df, is.na(df), FALSE)

UPDATED for an another solution.

df2 <- df <- data.frame(
  id = c(rep(1:19), NA),
  x1 = sample(c(NA, TRUE), 20, replace = TRUE),
  x2 = sample(c(NA, TRUE), 20, replace = TRUE)
)
df2[names(df) == "id"] <- FALSE
df2[names(df) != "id"] <- TRUE
replace(df, is.na(df) & df2, FALSE)

Nail answered 2/9, 2011 at 4:8 Comment(0)

You can use the NAToUnknown function in the gdata package

df[,c('x1', 'x2')] = gdata::NAToUnknown(df[,c('x1', 'x2')], unknown = 'FALSE')

Marnimarnia answered 2/9, 2011 at 13:53 Comment(1)

Excellent function except for one snag - if I want to change unknowns to 0, and I already have some NAs and zeroes in the vector, then I receive the error message Error in NAToUnknown.default(x = dots[[1L]][[1L]], unknown = dots[[2L]][[1L]], : 'x' already has value “0”. – Casaleggio 1/3, 2012 at 19:22

With dplyr you could also do

df %>% mutate_each(funs(replace(., is.na(.), F)), x1, x2)

It is a bit less readable compared to just using replace() but more generic as it allows to select the columns to be transformed. This solution especially applies if you want to keep NAs in some columns but want to get rid of NAs in others.

Mountainous answered 27/3, 2015 at 15:31 Comment(0)

An option would be to use a for loop.

for(i in c("x1", "x2")) df[[i]][is.na(df[[i]])] <- FALSE

Benchmark

set.seed(42)
df <- data.frame(
  id = c(rep(1:19),NA),
  x1 = sample(c(NA,TRUE), 20, replace = TRUE),
  x2 = sample(c(NA,TRUE), 20, replace = TRUE)
)

bench::mark(check=FALSE,
"Holger Brandl" = local(dplyr::mutate_each(df, dplyr::funs(replace(., is.na(.), F)), x1, x2)),
"mtelesha" = local(df <- tidyr::replace_na(df, list(x1 = FALSE, x2 = FALSE))),
Ramnath = local(df[,c('x1', 'x2')] <- gdata::NAToUnknown(df[,c('x1', 'x2')], unknown = 'FALSE')),
"Hong Ooi" = local(df[c("x1", "x2")][is.na(df[c("x1", "x2")])] <- FALSE),
GKi = local(for(i in c("x1", "x2")) df[[i]][is.na(df[[i]])] <- FALSE) )
#  expression         min   median `itr/sec` mem_al…¹ gc/se…² n_itr  n_gc total…³
#  <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:by>   <dbl> <int> <dbl> <bch:t>
#1 Holger Brandl  16.93ms  17.33ms      57.6  34.43KB    19.2    21     7   365ms
#2 mtelesha        3.94ms   4.39ms     226.    8.15KB    13.1   103     6   456ms
#3 Ramnath       400.28µs 415.44µs    2381.    1.55KB    16.7  1142     8   480ms
#4 Hong Ooi      196.87µs 206.72µs    4755.      488B    18.8  2276     9   479ms
#5 GKi             61.8µs  66.16µs   14808.      280B    20.9  7076    10   478ms

The for-loop is about 3 times faster than Hong Ooi the second and uses the lowest amount of memory.

Saporific answered 28/2, 2023 at 8:17 Comment(0)

UPDATE #1

Recommended topics

Hot tags