Making non-ASCII data suitable for CRAN
Asked Answered
S

1

29

I have some data that contains non-ASCII characters, that I want to include as an rda file in an R package. When I run an R CMD check on the package, I get a warning:

Warning: found non-ASCII strings

which is blocking it being allowed on CRAN.

There's a similar question about removing non-ASCII characters from data files, but I want to keep the non-ASCII characters.

You can grab the CSV data here. I'm reading it into R and resaving as rda with this code:

english_monarchs <- read.csv(
  wherever_you_downloaded_the_file_to, 
  fileEncoding     = "utf8",
  na.strings       = ""
)
save(english_monarchs, "english_monarchs.csv")

It's the name column of the dataset that contains non-ascii values.

head(levels(english_monarchs$name))
## [1] "Adda"                                "Æðelbehrt"                          
## [3] "Æðelberht I"                         "Æðelberht II and Eardwulf"          
## [5] "Æðelberht II, Ælfric and Eadberht I" "Æðelberht III"

Based upon the (not very clear) guidance in the Encoding Issues section of Writing R Extensions I think I ought to be encoding the factor levels as UTF-8, but the obvious method doesn't work:

Encoding(levels(english_monarchs$name)) <- "utf8"  #each encoding still "unknown"

How can I make the data portable enough to be accepted on CRAN?

Stores answered 16/9, 2013 at 21:46 Comment(6)
Not sure it makes any difference, but isn't it supposed to be "UTF-8"?Megavolt
@JoshuaUlrich R understands the encoding with or without the dash. iconvlist() contains both strings.Stores
Odd, because some encodings are changed when I use "UTF-8" on 64-bit Ubuntu and Windows 7.Megavolt
From ?Encoding: Character strings in R can be declared to be in ‘"latin1"’ or ‘"UTF-8"’ or ‘"bytes"’. You can't label strings with arbitrary encodings, as from iconv.Composer
OK, UTF-8 does correctly change encodings on my machine too. If you want to write it up as an answer, I'll accept it. Thanks.Stores
This looks like a cool package -- what is it?Swaine
S
16

The thing that worked for me was to declare the encoding as "latin1", and then use iconv to convert to UTF-8.

Encoding(levels(english_monarchs$name)) <- "latin1"
levels(english_monarchs$name) <- iconv(
  levels(english_monarchs$name), 
  "latin1", 
  "UTF-8"
)
Stores answered 20/10, 2013 at 17:36 Comment(2)
seems like I'm hitting every one of your CRAN questions tonight! Have all the same issues!Epidemiology
...or "latin2" ;)Towering

© 2022 - 2024 — McMap. All rights reserved.