Removing non-ASCII characters from data files
Asked Answered
S

5

82

I've got a bunch of csv files that I'm reading into R and including in a package/data folder in .rdata format. Unfortunately the non-ASCII characters in the data fail the check. The tools package has two functions to check for non-ASCII characters (showNonASCII and showNonASCIIfile) but I can't seem to locate one to remove/clean them.

Before I explore other UNIX tools, it would be great to do this all in R so I can maintain a complete workflow from raw data to final product. Are there any existing packages/functions to help me get rid of the non-ASCII characters?

Sabatier answered 29/3, 2012 at 23:6 Comment(2)
Try with regular expressions, for instance the function gsub. Check ?regexpStriated
You are aware that read.csv() takes an encoding argument, so you can handle these, at least in R? What specific check do the non-ASCII characters fail, is it in R (if so post it here), or external?Allmon
F
92

To simply remove the non-ASCII characters, you could use base R's iconv(), setting sub = "". Something like this should work:

x <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher") # e.g. from ?iconv
Encoding(x) <- "latin1"  # (just to make sure)
x
# [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"

iconv(x, "latin1", "ASCII", sub="")
# [1] "Ekstrm"        "Jreskog"       "bichen Zrcher"

To locate non-ASCII characters, or to find if there were any at all in your files, you could likely adapt the following ideas:

## Do *any* lines contain non-ASCII characters? 
any(grepl("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII")))
[1] TRUE

## Find which lines (e.g. read in by readLines()) contain non-ASCII characters
grep("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII"))
[1] 1 2 3
Forceps answered 29/3, 2012 at 23:50 Comment(1)
saved my day's' - conversion utf8 did not work for me, but conversion to ASCII did it.Luminiferous
T
100

These days, a slightly better approach is to use the stringi package which provides a function for general unicode conversion. This allows you to preserve the original text as much as possible:

x <- c("Ekstr\u00f8m", "J\u00f6reskog", "bi\u00dfchen Z\u00fcrcher")
x
#> [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"

stringi::stri_trans_general(x, "latin-ascii")
#> [1] "Ekstrom"          "Joreskog"         "bisschen Zurcher"
Temporal answered 16/5, 2016 at 13:17 Comment(2)
Any thoughts how I can make it work with stringi -- iconv("Klinik. der Univ. zu K_ln (AA\u0090R)","latin1","ASCII",sub="") => [1] "Klinik. der Univ. zu K_ln (AAR)" but stringi::stri_trans_general("Klinik. der Univ. zu K_ln (AA\u0090R)", "latin-ascii") => [1] "Klinik. der Univ. zu K_ln (AA\u0090R)"Overvalue
stringi::stri_trans_general(x, "latin-ascii") removes some of the non-ASCII characters in my text, but not others. tools::showNonASCII reveals the non-removed characters are:zero width space, trademark sign, Euro sign, narrow no-break space. Does this mean "latin-ascii" is the wrong transform identifier for my string? Is there a straightforward way to figure out the correct transform identifier? ThanksVibratile
F
92

To simply remove the non-ASCII characters, you could use base R's iconv(), setting sub = "". Something like this should work:

x <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher") # e.g. from ?iconv
Encoding(x) <- "latin1"  # (just to make sure)
x
# [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"

iconv(x, "latin1", "ASCII", sub="")
# [1] "Ekstrm"        "Jreskog"       "bichen Zrcher"

To locate non-ASCII characters, or to find if there were any at all in your files, you could likely adapt the following ideas:

## Do *any* lines contain non-ASCII characters? 
any(grepl("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII")))
[1] TRUE

## Find which lines (e.g. read in by readLines()) contain non-ASCII characters
grep("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII"))
[1] 1 2 3
Forceps answered 29/3, 2012 at 23:50 Comment(1)
saved my day's' - conversion utf8 did not work for me, but conversion to ASCII did it.Luminiferous
A
5

I often have trouble with iconv and I'm a base R fan.

So instead to remove unicode or non-ASCII I use gsub, using lapply to apply it to an entire dataframe.

gsub("[^\u0001-\u007F]+|<U\\+\\w+>","", string)

The benefit of this gsub is that it will match a range of notation formats. Below I show the individual matches for the two patterns.

x1 <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher")
gsub("[^\u0001-\u007F]+","", x1)
## "Ekstrm"        "Jreskog"       "bichen Zrcher"
x2 <- c("Ekstr\u00f8m", "J\u00f6reskog", "bi\u00dfchen Z\u00fcrcher")
gsub("[^\u0001-\u007F]+","", x2)
## Same as x1
## "Ekstrm"        "Jreskog"       "bichen Zrcher"
x3 <- c("<U+FDFA>", "1<U+2009>00", "X<U+203E>")
gsub("<U\\+\\w+>","", x3)
## ""    "100" "X"
Actinopod answered 26/1, 2021 at 10:2 Comment(0)
O
3

To remove all words with non-ascii characters (borrowing code from @Hadley), you can use the package xfun with filter from dplyr

x <- c("Ekstr\u00f8m", "J\u00f6reskog", "bi\u00dfchen Z\u00fcrcher", "alex")
x

x %>% 
  tibble(name = .) %>%
  filter(xfun::is_ascii(name)== T)
Oto answered 7/5, 2019 at 18:41 Comment(0)
O
2

textclean::replace_non_ascii() did the job for me. This function removes not only special letters, but euro, trademark and other symbols also.

    x <- c("Ekstr\u00f8m \u2605", "J\u00f6reskog \u20ac", "bi\u00dfchen Z\u00fcrcher \u2122")

 stringi::stri_trans_general(x, "latin-ascii")
    [1] "Ekstrom ★"          "Joreskog €"         "bisschen Zurcher ™"
    
textclean::replace_non_ascii(x)
    [1] "Ekstrom"               "Joreskog"              "bisschen Zurcher cent"
Orthotropic answered 21/3, 2021 at 16:8 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.