How to identify/delete non-UTF-8 characters in R
Asked Answered
S

4

44

When I import a Stata dataset in R (using the foreign package), the import sometimes contains characters that are not valid UTF-8. This is unpleasant enough by itself, but it breaks everything as soon as I try to transform the object to JSON (using the rjson package).

How I can identify non-valid-UTF-8-characters in a string and delete them after that?

Stoops answered 25/6, 2013 at 7:13 Comment(0)
V
50

Another solution using iconv and it argument sub: character string. If not NA(here I set it to ''), it is used to replace any non-convertible bytes in the input.

x <- "fa\xE7ile"
Encoding(x) <- "UTF-8"
iconv(x, "UTF-8", "UTF-8",sub='') ## replace any non UTF-8 by ''
"faile"

Here note that if we choose the right encoding:

x <- "fa\xE7ile"
Encoding(x) <- "latin1"
xx <- iconv(x, "latin1", "UTF-8",sub='')
facile
Venture answered 25/6, 2013 at 8:1 Comment(0)
C
10

Yihui's xfun package has a function, read_utf8, that attempts to read a file and assumes it is encoded as UTF-8. If the file contains non-UTF-8 lines, a warning is triggered, letting you know which line(s) contain non-UTF-8 characters. Under the hood it uses a non exported function xfun:::invalid_utf8() which is simply the following: which(!is.na(x) & is.na(iconv(x, "UTF-8", "UTF-8"))).

To detect specific non-UTF-8 words in a string, you could modify the above slightly and do something like:

invalid_utf8_ <- function(x){

  !is.na(x) & is.na(iconv(x, "UTF-8", "UTF-8"))

}

detect_invalid_utf8 <- function(string, seperator){

  stringSplit <- unlist(strsplit(string, seperator))

  invalidIndex <- unlist(lapply(stringSplit, invalid_utf8_))

  data.frame(
    word = stringSplit[invalidIndex],
    stringIndex = which(invalidIndex == TRUE)
  )

}

x <- "This is a string fa\xE7ile blah blah blah fa\xE7ade"

detect_invalid_utf8(x, " ")

#     word stringIndex
# 1 façile    5
# 2 façade    9
Crevice answered 29/7, 2019 at 21:5 Comment(0)
C
4

Another approach to remove the bad chars using dplyr on the whole dataset:

library(dplyr)

MyDate %>%
    mutate_at(vars(MyTextVar1, MyTextVar2), function(x){gsub('[^ -~]', '', x)})

Where MyData and MyTextVar are the data set and the text variables to remove the bad apples from. This may be less robust than changing encoding but often it's fine and easier to just remove them.

Ce answered 15/8, 2018 at 13:23 Comment(2)
Buidling on Tyler's answer, you could also consider MyDate %>% mutate_if(is.character, ~gsub('[^ -~]', '', .)) which targets all character columns or MyData %>% mutate_all(~gsub('[^ -~]', '', .)) which targets all columns.Fungible
This removes way more chars than needed. The question asked for non-UTF8, not non-ASCII.Forgive
S
2

Instead of deleting them you can try to convert them into UTF-8 string using iconv.

require(foreign)
dat <- read.dta("data.dta")

for (j in seq_len(ncol(dat))) {
   if (class(dat[, j]) == "factor")
       levels(dat[, j]) <- iconv(levels(dat[, j]), from = "latin1", to = "UTF-8")
}

You can replace latin1 by a more suitable enconding in your case. Since we don't have access to your data is difficult to know which one will be more suitable.

Superable answered 25/6, 2013 at 7:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.