Yihui's xfun
package has a function, read_utf8
, that attempts to read a file and assumes it is encoded as UTF-8. If the file contains non-UTF-8 lines, a warning is triggered, letting you know which line(s) contain non-UTF-8 characters. Under the hood it uses a non exported function xfun:::invalid_utf8()
which is simply the following: which(!is.na(x) & is.na(iconv(x, "UTF-8", "UTF-8")))
.
To detect specific non-UTF-8 words in a string, you could modify the above slightly and do something like:
invalid_utf8_ <- function(x){
!is.na(x) & is.na(iconv(x, "UTF-8", "UTF-8"))
}
detect_invalid_utf8 <- function(string, seperator){
stringSplit <- unlist(strsplit(string, seperator))
invalidIndex <- unlist(lapply(stringSplit, invalid_utf8_))
data.frame(
word = stringSplit[invalidIndex],
stringIndex = which(invalidIndex == TRUE)
)
}
x <- "This is a string fa\xE7ile blah blah blah fa\xE7ade"
detect_invalid_utf8(x, " ")
# word stringIndex
# 1 façile 5
# 2 façade 9
MyDate %>% mutate_if(is.character, ~gsub('[^ -~]', '', .))
which targets all character columns orMyData %>% mutate_all(~gsub('[^ -~]', '', .))
which targets all columns. – Fungible