Convert a file encoding using R? (ANSI to UTF-8)
Asked Answered
D

2

10

I wish to convert an HTML file encoded in ANSI to UTF-8, using R.

Is there a tool, or a combination of tools, that can make this work?

Thanks.

Edit: o.k, I've narrowed my problem to another one. It is re-posted here: Using "cat" to write non-English characters into a .html file (in R)

Dysteleology answered 20/9, 2011 at 7:52 Comment(0)
S
23

you can use iconv:

writeLines(iconv(readLines("tmp.html"), from = "ANSI_X3.4-1986", to = "UTF8"), "tmp2.html")

tmp2.html should be utf-8.


Edit by Henrik in June 2015:
A working solution for Windows distilled from the comments is as follows:

writeLines(iconv(readLines("tmp.html"), from = "ANSI_X3.4-1986", to = "UTF8"), 
           file("tmp2.html", encoding="UTF-8"))

Update 2021: And if ANSI is the current locale, the following works as well (i.e., uses the local encoding as from source):

writeLines(iconv(readLines("tmp.html"), from = "", to = "UTF8"), 
           file("tmp2.html", encoding="UTF-8"))
Statocyst answered 20/9, 2011 at 8:33 Comment(7)
But what with html headers? Shouldn't be changed either?Destine
Thanks Kohske, but this doesn't work for me. It will convert the text in the file, but in some weird way, not the file itself. When I used notepad++ to look at the encoding, it is still ANSI, and only through notepad++ can I change it to UTF8 (your code won't do it). Any suggestions? :)Dysteleology
How about changing from = "CP1252" ?Statocyst
Kohske - this is indeed the correct encoding to use. But when I read the file into R, it interprets the text correctly. I'll try to update my question to better explain...Dysteleology
@TalGalili You need to define file connection with proper encoding (see ?file). Something like f<-file("tmp2.html", encoding="UTF-8") and then writeLines(....., f).Destine
Thanks Marek. This looks to be in the right direction, but no success yet. Please continue this on the new thread I started (which has an updated question): #7484242Dysteleology
What does your test html file contain? From ?Encoding: "ASCII strings will never be marked with a declared encoding, since their representation is the same in all supported encodings." Also try useBytes = TRUE in the call to writeLines.Hornbeck
B
0

I had some problems with the solutions proposed above, especially with the TAB character. This alternative never disappointed me. Unfortunately it only works on UNIX-like systems.

system('iconv -f CP1252 -t UTF-8 < tmp.html > tmp2.html')
Boor answered 9/11, 2018 at 14:40 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.