Removing non-ASCII characters from data files

Asked 29/3, 2012 at 23:6 Answered 21/3, 2021 at 16:8

Solved r unicode ascii non-ascii-characters

I've got a bunch of csv files that I'm reading into R and including in a package/data folder in .rdata format. Unfortunately the non-ASCII characters in the data fail the check. The tools package has two functions to check for non-ASCII characters (showNonASCII and showNonASCIIfile) but I can't seem to locate one to remove/clean them.

Before I explore other UNIX tools, it would be great to do this all in R so I can maintain a complete workflow from raw data to final product. Are there any existing packages/functions to help me get rid of the non-ASCII characters?

Sabatier answered 29/3, 2012 at 23:6 Comment(2)

Try with regular expressions, for instance the function gsub. Check ?regexp – Striated 29/3, 2012 at 23:22

You are aware that read.csv() takes an encoding argument, so you can handle these, at least in R? What specific check do the non-ASCII characters fail, is it in R (if so post it here), or external? – Allmon 12/8, 2016 at 8:2

To simply remove the non-ASCII characters, you could use base R's iconv(), setting sub = "". Something like this should work:

x <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher") # e.g. from ?iconv
Encoding(x) <- "latin1"  # (just to make sure)
x
# [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"

iconv(x, "latin1", "ASCII", sub="")
# [1] "Ekstrm"        "Jreskog"       "bichen Zrcher"

To locate non-ASCII characters, or to find if there were any at all in your files, you could likely adapt the following ideas:

## Do *any* lines contain non-ASCII characters? 
any(grepl("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII")))
[1] TRUE

## Find which lines (e.g. read in by readLines()) contain non-ASCII characters
grep("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII"))
[1] 1 2 3

Forceps answered 29/3, 2012 at 23:50 Comment(1)

saved my day's' - conversion utf8 did not work for me, but conversion to ASCII did it. – Luminiferous 15/2, 2023 at 5:55

100

These days, a slightly better approach is to use the stringi package which provides a function for general unicode conversion. This allows you to preserve the original text as much as possible:

x <- c("Ekstr\u00f8m", "J\u00f6reskog", "bi\u00dfchen Z\u00fcrcher")
x
#> [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"

stringi::stri_trans_general(x, "latin-ascii")
#> [1] "Ekstrom"          "Joreskog"         "bisschen Zurcher"

Temporal answered 16/5, 2016 at 13:17 Comment(2)

Any thoughts how I can make it work with stringi -- iconv("Klinik. der Univ. zu K_ln (AA\u0090R)","latin1","ASCII",sub="") => [1] "Klinik. der Univ. zu K_ln (AAR)" but

stringi::stri_trans_general("Klinik. der Univ. zu K_ln (AA\u0090R)", "latin-ascii") => [1] "Klinik. der Univ. zu K_ln (AA\u0090R)"

– Overvalue 13/11, 2017 at 1:27

stringi::stri_trans_general(x, "latin-ascii") removes some of the non-ASCII characters in my text, but not others. tools::showNonASCII reveals the non-removed characters are:zero width space, trademark sign, Euro sign, narrow no-break space. Does this mean "latin-ascii" is the wrong transform identifier for my string? Is there a straightforward way to figure out the correct transform identifier? Thanks – Vibratile 4/7, 2020 at 23:19

To simply remove the non-ASCII characters, you could use base R's iconv(), setting sub = "". Something like this should work:

x <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher") # e.g. from ?iconv
Encoding(x) <- "latin1"  # (just to make sure)
x
# [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"

iconv(x, "latin1", "ASCII", sub="")
# [1] "Ekstrm"        "Jreskog"       "bichen Zrcher"

To locate non-ASCII characters, or to find if there were any at all in your files, you could likely adapt the following ideas:

## Do *any* lines contain non-ASCII characters? 
any(grepl("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII")))
[1] TRUE

## Find which lines (e.g. read in by readLines()) contain non-ASCII characters
grep("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII"))
[1] 1 2 3

Forceps answered 29/3, 2012 at 23:50 Comment(1)

saved my day's' - conversion utf8 did not work for me, but conversion to ASCII did it. – Luminiferous 15/2, 2023 at 5:55

I often have trouble with iconv and I'm a base R fan.

So instead to remove unicode or non-ASCII I use gsub, using lapply to apply it to an entire dataframe.

gsub("[^\u0001-\u007F]+|<U\\+\\w+>","", string)

The benefit of this gsub is that it will match a range of notation formats. Below I show the individual matches for the two patterns.

x1 <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher")
gsub("[^\u0001-\u007F]+","", x1)
## "Ekstrm"        "Jreskog"       "bichen Zrcher"

x2 <- c("Ekstr\u00f8m", "J\u00f6reskog", "bi\u00dfchen Z\u00fcrcher")
gsub("[^\u0001-\u007F]+","", x2)
## Same as x1
## "Ekstrm"        "Jreskog"       "bichen Zrcher"

x3 <- c("<U+FDFA>", "1<U+2009>00", "X<U+203E>")
gsub("<U\\+\\w+>","", x3)
## ""    "100" "X"

Actinopod answered 26/1, 2021 at 10:2 Comment(0)

To remove all words with non-ascii characters (borrowing code from @Hadley), you can use the package xfun with filter from dplyr

x <- c("Ekstr\u00f8m", "J\u00f6reskog", "bi\u00dfchen Z\u00fcrcher", "alex")
x

x %>% 
  tibble(name = .) %>%
  filter(xfun::is_ascii(name)== T)

Oto answered 7/5, 2019 at 18:41 Comment(0)

textclean::replace_non_ascii() did the job for me. This function removes not only special letters, but euro, trademark and other symbols also.

    x <- c("Ekstr\u00f8m \u2605", "J\u00f6reskog \u20ac", "bi\u00dfchen Z\u00fcrcher \u2122")

 stringi::stri_trans_general(x, "latin-ascii")
    [1] "Ekstrom ★"          "Joreskog €"         "bisschen Zurcher ™"
    
textclean::replace_non_ascii(x)
    [1] "Ekstrom"               "Joreskog"              "bisschen Zurcher cent"

Orthotropic answered 21/3, 2021 at 16:8 Comment(0)

Recommended topics

Hot tags