Making non-ASCII data suitable for CRAN - McMap

About

Making non-ASCII data suitable for CRAN

Asked 16/9, 2013 at 21:46 Answered 20/10, 2013 at 17:36

Solved r portability cran

S

1

29

I have some data that contains non-ASCII characters, that I want to include as an rda file in an R package. When I run an R CMD check on the package, I get a warning:

Warning: found non-ASCII strings

which is blocking it being allowed on CRAN.

There's a similar question about removing non-ASCII characters from data files, but I want to keep the non-ASCII characters.

You can grab the CSV data here. I'm reading it into R and resaving as rda with this code:

english_monarchs <- read.csv(
  wherever_you_downloaded_the_file_to, 
  fileEncoding     = "utf8",
  na.strings       = ""
)
save(english_monarchs, "english_monarchs.csv")

It's the name column of the dataset that contains non-ascii values.

head(levels(english_monarchs$name))
## [1] "Adda"                                "Æðelbehrt"                          
## [3] "Æðelberht I"                         "Æðelberht II and Eardwulf"          
## [5] "Æðelberht II, Ælfric and Eadberht I" "Æðelberht III"

Based upon the (not very clear) guidance in the Encoding Issues section of Writing R Extensions I think I ought to be encoding the factor levels as UTF-8, but the obvious method doesn't work:

Encoding(levels(english_monarchs$name)) <- "utf8"  #each encoding still "unknown"

How can I make the data portable enough to be accepted on CRAN?

Stores answered 16/9, 2013 at 21:46 Comment(6)

Not sure it makes any difference, but isn't it supposed to be "UTF-8"? – Megavolt 16/9, 2013 at 21:58

@JoshuaUlrich R understands the encoding with or without the dash. iconvlist() contains both strings. – Stores 17/9, 2013 at 6:20

Odd, because some encodings are changed when I use "UTF-8" on 64-bit Ubuntu and Windows 7. – Megavolt 17/9, 2013 at 12:13

From ?Encoding: Character strings in R can be declared to be in ‘"latin1"’ or ‘"UTF-8"’ or ‘"bytes"’. You can't label strings with arbitrary encodings, as from iconv. – Composer 17/9, 2013 at 20:46

OK, UTF-8 does correctly change encodings on my machine too. If you want to write it up as an answer, I'll accept it. Thanks. – Stores 18/9, 2013 at 13:25

This looks like a cool package -- what is it? – Swaine 10/3, 2016 at 15:28

S

16

The thing that worked for me was to declare the encoding as "latin1", and then use iconv to convert to UTF-8.

Encoding(levels(english_monarchs$name)) <- "latin1"
levels(english_monarchs$name) <- iconv(
  levels(english_monarchs$name), 
  "latin1", 
  "UTF-8"
)

Stores answered 20/10, 2013 at 17:36 Comment(2)

seems like I'm hitting every one of your CRAN questions tonight! Have all the same issues! – Epidemiology 15/5, 2016 at 2:55

...or "latin2" ;) – Towering 9/4, 2017 at 16:47

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.