Replace all particular values in a data frame

Asked 21/10, 2013 at 19:41 Answered 29/2 at 9:47

113

Having a data frame, how do I go about replacing all particular values along all rows and columns. Say for example I want to replace all empty records with NA's (without typing the positions):

df <- data.frame(list(A=c("", "xyz", "jkl"), B=c(12, "", 100)))

    A   B
1      12
2  xyz    
3  jkl 100

Expected result:

    A   B
1  NA   12
2  xyz  NA  
3  jkl  100

Nutrient answered 21/10, 2013 at 19:41 Comment(0)

169

Like this:

> df[df==""]<-NA
> df
     A    B
1 <NA>   12
2  xyz <NA>
3  jkl  100

Nihi answered 21/10, 2013 at 19:44 Comment(5)

is there a way to do this efficiently for more than 1 value!? – Delindadelineate 11/3, 2015 at 10:23

This doesn't work for factors, df[df=="xyz"]<-"abc" will error with "invalid factor level." Is there a more general solution? – Irita 2/9, 2015 at 4:22

not working for me. I tried this: dfSmallDiscreteCustomSalary[dfSmallDiscreteCustomSalary$salary=="<=50K"] <- "49K". Still for unique(dfSmallDiscreteCustomSalary$salary) i get: [1] >50K <=50K – Georgetown 5/11, 2015 at 12:24

glallen ... if you're trying to modify a factor column with a new value that already a factor, there are probably more clever ways that what I'm about to suggest, but you could df$factorcolumn <- as.character(df$factorcolumn), then make your modification, and finish off by turning it back into a factor again... df$factorcolumn <- as.factor(df$factorcolumn); it'll be complete with your new level and desired value. – Spiro 25/10, 2017 at 22:54

Found it out: df.na.replace(df.columns, Map("" -> "NA")).show. Interestingly I am not able to replace with null as value. I am getting : java.lang.IllegalArgumentException: Unsupported value type java.lang.String (null). at org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$convertToDouble(DataFrameNaFunctions.scala:434) – Ruberta 27/10, 2017 at 20:24

Since PikkuKatja and glallen asked for a more general solution and I cannot comment yet, I'll write an answer. You can combine statements as in:

> df[df=="" | df==12] <- NA
> df
     A    B
1  <NA> <NA>
2  xyz  <NA>
3  jkl  100

For factors, zxzak's code already yields factors:

> df <- data.frame(list(A=c("","xyz","jkl"), B=c(12,"",100)))
> str(df)
'data.frame':   3 obs. of  2 variables:
 $ A: Factor w/ 3 levels "","jkl","xyz": 1 3 2
 $ B: Factor w/ 3 levels "","100","12": 3 1 2

If in trouble, I'd suggest to temporarily drop the factors.

df[] <- lapply(df, as.character)

Jarman answered 8/12, 2015 at 1:12 Comment(0)

Here are a couple dplyr options:

library(dplyr)

# all columns:
df %>% 
  mutate_all(~na_if(., ''))

# specific column types:
df %>% 
  mutate_if(is.factor, ~na_if(., ''))

# specific columns:  
df %>% 
  mutate_at(vars(A, B), ~na_if(., ''))

# or:
df %>% 
  mutate(A = replace(A, A == '', NA))

# replace can be used if you want something other than NA:
df %>% 
  mutate(A = as.character(A)) %>% 
  mutate(A = replace(A, A == '', 'used to be empty'))

Annelleannemarie answered 22/3, 2019 at 1:48 Comment(2)

How would you go about using the all columns solution to replace several strings by NAs in the whole dataset? – Hoahoactzin 20/9, 2019 at 16:19

These options are still perfectly valid, just note that the the "mutate_at" and "mutate_all" functions have been superseded with the "across()." They're still supported, but R recommends "across()" instead. More details here: dplyr.tidyverse.org/reference/across.html – Tubb 17/6, 2021 at 21:10

We can use data.table to get it quickly. First create df without factors,

df <- data.frame(list(A=c("","xyz","jkl"), B=c(12,"",100)), stringsAsFactors=F)

Now you can use

setDT(df)
for (jj in 1:ncol(df)) set(df, i = which(df[[jj]]==""), j = jj, v = NA)

and you can convert it back to a data.frame

setDF(df)

If you only want to use data.frame and keep factors it's more difficult, you need to work with

levels(df$value)[levels(df$value)==""] <- NA

where value is the name of every column. You need to insert it in a loop.

Abbreviate answered 28/11, 2016 at 19:28 Comment(2)

Why would you use an external library for this use case? Why a loop if this can be solved with one line? How does your answer add value beyond the answers already present? I don't intend to be harsh, I think I am missing something, hence the questions. – Jarman 21/6, 2017 at 23:34

It's much faster for large datasets. It adds an alternative so that the user can choose the best for him. – Abbreviate 22/6, 2017 at 9:12

If you want to replace multiple values in a data frame, looping through all columns might help.

Say you want to replace "" and 100:

na_codes <- c(100, "")
for (i in seq_along(df)) {
    df[[i]][df[[i]] %in% na_codes] <- NA
}

Compact answered 7/4, 2017 at 2:11 Comment(0)

It appears that a solution is missing for multiple values to be replaced and for factors, so I will add one.

Consider a data frame dat with various classes.

dat
#    character integer       Date factor               POSIX
# 1                  4 2022-07-10      B 2022-07-10 20:08:10
# 2                  1 2022-07-11    FOO 2022-07-10 21:08:10
# 3                 -2 2022-07-12        2022-07-10 22:08:10
# 4                  2 2022-07-13      B 2022-07-10 23:08:10
# 5          a       3 2022-07-14        2022-07-11 00:08:10
# 6          c       1 2022-07-15        2022-07-11 01:08:10
# 7          a      -1 2022-07-16    FOO 2022-07-11 02:08:10
# 8          a      -1 2022-07-17      A 2022-07-11 03:08:10
# 9                  4 2022-07-18    FOO 2022-07-11 04:08:10
# 10         c       0 2022-07-19    FOO 2022-07-11 05:08:10
# 11         b      -2 2022-07-20      B 2022-07-11 06:08:10
# 12         c      -2 2022-07-21      A 2022-07-11 07:08:10

We may put everything we want to convert to NA on a list to_na,

To_NA <- list('', -1, -2, 'c', 'FOO', as.Date('2022-07-17'), as.POSIXct('2022-07-11 00:08:10'))

and use it in a small function make_na based on replace. if the respective variable is.factor we may want to droplevels of values that have just been deleted.

make_na <- \(x, z) {x <- replace(x, x %in% z, NA); if (is.factor(x)) droplevels(x) else x}

We can apply it on a vector,

make_na(dat$character, To_NA)
# [1] NA  NA  NA  NA  "a" NA  "a" "a" NA  NA  "b" NA

or loop over the columns using lapply.

dat[] <- lapply(dat, make_na, To_NA)

Gives

dat
#    character integer       Date factor               POSIX
# 1       <NA>       4 2022-07-10      B 2022-07-10 20:08:10
# 2       <NA>       1 2022-07-11   <NA> 2022-07-10 21:08:10
# 3       <NA>      NA 2022-07-12   <NA> 2022-07-10 22:08:10
# 4       <NA>       2 2022-07-13      B 2022-07-10 23:08:10
# 5          a       3 2022-07-14   <NA>                <NA>
# 6       <NA>       1 2022-07-15   <NA> 2022-07-11 01:08:10
# 7          a      NA 2022-07-16   <NA> 2022-07-11 02:08:10
# 8          a      NA       <NA>      A 2022-07-11 03:08:10
# 9       <NA>       4 2022-07-18   <NA> 2022-07-11 04:08:10
# 10      <NA>       0 2022-07-19   <NA> 2022-07-11 05:08:10
# 11         b      NA 2022-07-20      B 2022-07-11 06:08:10
# 12      <NA>      NA 2022-07-21      A 2022-07-11 07:08:10

Where:

str(dat)
# 'data.frame': 12 obs. of  5 variables:
#  $ character: chr  NA NA NA NA ...
#  $ integer  : int  4 1 NA 2 3 1 NA NA 4 0 ...
#  $ Date     : Date, format: "2022-07-10" "2022-07-11" "2022-07-12" ...
#  $ factor   : Factor w/ 2 levels "A","B": 2 NA NA 2 NA NA NA 1 NA NA ...
#  $ POSIX    : POSIXct, format: "2022-07-10 20:08:10" "2022-07-10 21:08:10" "2022-07-10 22:08:10" ...

Data:

dat <- structure(list(character = c("", "", "", "", "a", "c", "a", "a", 
"", "c", "b", "c"), integer = c(4L, 1L, -2L, 2L, 3L, 1L, -1L, 
-1L, 4L, 0L, -2L, -2L), Date = structure(c(19183, 19184, 19185, 
19186, 19187, 19188, 19189, 19190, 19191, 19192, 19193, 19194
), class = "Date"), factor = structure(c(3L, 4L, 1L, 3L, 1L, 
1L, 4L, 2L, 4L, 4L, 3L, 2L), levels = c("", "A", "B", "FOO"), class = "factor"), 
    POSIX = structure(c(1657476490L, 1657480090L, 1657483690L, 
    1657487290L, 1657490890L, 1657494490L, 1657498090L, 1657501690L, 
    1657505290L, 1657508890L, 1657512490L, 1657516090L), class = c("POSIXct", 
    "POSIXt"), tzone = "")), class = "data.frame", row.names = c(NA, 
-12L))

Zaporozhye answered 10/7, 2022 at 18:52 Comment(0)

Another option could be using sapply. Here is a reproducible example:

df <- data.frame(list(A=c("", "xyz", "jkl"), B=c(12, "", 100)))
df[sapply(df, \(x) x == "")] <- NA
df
#>      A    B
#> 1 <NA>   12
#> 2  xyz <NA>
#> 3  jkl  100

^{Created on 2023-01-15 with reprex v2.0.2}

Please note: with R 4.1.0 and later you could use \(x) instead of function(x)

Catalan answered 15/1, 2023 at 10:18 Comment(0)

Another option is to use replace_with_na_all() from the naniar package, which allows you to replace all the values meeting a condition in the entire dataframe.

library(naniar)
library(dplyr)
    
df %>% 
  replace_with_na_all(condition = ~.x == "")

Output

  A     B    
  <chr> <chr>
1 NA    12   
2 xyz   NA   
3 jkl   100

The upside to this method is that if you also had some cells that also had spaces included, then we could provide both in the conditions argument. Although it would be better to first just trim the whitespace, then use the function above (i.e., adding mutate(across(everything(), ~ trimws(.x))) to the pipe).

df <- data.frame(list(A=c("", "xyz", "  "), B=c(12, "   ", 100)))

df %>%
  replace_with_na_all(condition = ~.x %in% c("", "  ", "   "))

#  A     B    
#  <chr> <chr>
#1 NA    12   
#2 xyz   NA   
#3 NA    100

Erlking answered 10/7, 2022 at 21:14 Comment(0)

It´s also possible to use the gsub function in combination with lapply.

df[] <- lapply(df, function(x) (gsub("", NA, x)))

Tsarina answered 29/2 at 9:47 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Gives

Recommended topics

Hot tags