I have some data that contains non-ASCII characters, that I want to include as an rda
file in an R package. When I run an R CMD check
on the package, I get a warning:
Warning: found non-ASCII strings
which is blocking it being allowed on CRAN.
There's a similar question about removing non-ASCII characters from data files, but I want to keep the non-ASCII characters.
You can grab the CSV data here. I'm reading it into R and resaving as rda
with this code:
english_monarchs <- read.csv(
wherever_you_downloaded_the_file_to,
fileEncoding = "utf8",
na.strings = ""
)
save(english_monarchs, "english_monarchs.csv")
It's the name
column of the dataset that contains non-ascii values.
head(levels(english_monarchs$name))
## [1] "Adda" "Æðelbehrt"
## [3] "Æðelberht I" "Æðelberht II and Eardwulf"
## [5] "Æðelberht II, Ælfric and Eadberht I" "Æðelberht III"
Based upon the (not very clear) guidance in the Encoding Issues section of Writing R Extensions I think I ought to be encoding the factor levels as UTF-8, but the obvious method doesn't work:
Encoding(levels(english_monarchs$name)) <- "utf8" #each encoding still "unknown"
How can I make the data portable enough to be accepted on CRAN?
"UTF-8"
? – Megavolticonvlist()
contains both strings. – Stores"UTF-8"
on 64-bit Ubuntu and Windows 7. – Megavolt?Encoding
: Character strings in R can be declared to be in ‘"latin1"’ or ‘"UTF-8"’ or ‘"bytes"’. You can't label strings with arbitrary encodings, as fromiconv
. – ComposerUTF-8
does correctly change encodings on my machine too. If you want to write it up as an answer, I'll accept it. Thanks. – Stores