R: can't read unicode text files even when specifying the encoding
Asked Answered
S

1

10

I'm using R 3.1.1 on Windows 7 32bits. I'm having a lot of problems reading some text files on which I want to perform textual analysis. According to Notepad++, the files are encoded with "UCS-2 Little Endian". (grepWin, a tool whose name says it all, says the file is "Unicode".)

The problem is that I can't seem to read the file even specifying that encoding. (The characters are of the standard spanish Latin set -ñáó- and should be handled easily with CP1252 or anything like that.)

> Sys.getlocale()
[1] "LC_COLLATE=Spanish_Spain.1252;LC_CTYPE=Spanish_Spain.1252;LC_MONETARY=Spanish_Spain.1252;LC_NUMERIC=C;LC_TIME=Spanish_Spain.1252"
> readLines("filename.txt")
 [1] "ÿþE" ""    ""    ""    ""   ...
> readLines("filename.txt",encoding="UTF-8")
 [1] "\xff\xfeE" ""          ""          ""          ""    ...
> readLines("filename.txt",encoding="UCS2LE")
 [1] "ÿþE" ""    ""    ""    ""    ""    ""     ...
> readLines("filename.txt",encoding="UCS2")
 [1] "ÿþE" ""    ""    ""    ""    ...

Any ideas?

Thanks!!


edit: the "UTF-16", "UTF-16LE" and "UTF-16BE" encondings fails similarly

Standard answered 10/10, 2014 at 18:34 Comment(3)
'\xff\xfe' is the UTF-16LE encoding of the byte order mark (BOM) character. Decoding with UTF-8 should fail as FFh is an invalid start byte, but I'm not familiar with R.Frigid
I've had similar struggles with encoding. Had more success with scan than I did readLines. Try scan("filename.txt", fileEncoding="UCS-2LE", sep="\n")Accentuate
Thanks for answering. I think I sould report this as a bug, right? scan does read the file (and I don't understand the difference between the fileEncoding and encoding params), but it creates other problems. First, it only takes "one byte separators", and if you use an absurd separator it falls back to space as a sep. Also, it strips the \r\n that I need to preserve. And finally, for some reason paste fails to concatenate the string (it just returns the original vector).Standard
S
17

After reading more closely to the documentation, I found the answer to my question.

The encoding param of readLines only applies to the param input strings. The documentation says:

encoding to be assumed for input strings. It is used to mark character strings as known to be in Latin-1 or UTF-8: it is not used to re-encode the input. To do the latter, specify the encoding as part of the connection con or via options(encoding=): see the examples. See also ‘Details’.

The proper way of reading a file with an uncommon encoding is, then,

filetext <- readLines(con <- file("UnicodeFile.txt", encoding = "UCS-2LE"))
close(con)
Standard answered 14/10, 2014 at 13:9 Comment(1)
Thanks this worked for me. I used: hht9aa <- read.csv(file("hht9aa_aa.txt",encoding="UCS-2LE")) And finally got it to read UTF-16 Little Endian files correctly. But I did not have to close(con), in fact I got an error when I did so, and eventually left it out.Brittneybrittni

© 2022 - 2024 — McMap. All rights reserved.