I'm using R 3.1.1 on Windows 7 32bits. I'm having a lot of problems reading some text files on which I want to perform textual analysis. According to Notepad++, the files are encoded with "UCS-2 Little Endian". (grepWin, a tool whose name says it all, says the file is "Unicode".)
The problem is that I can't seem to read the file even specifying that encoding. (The characters are of the standard spanish Latin set -ñáó- and should be handled easily with CP1252 or anything like that.)
> Sys.getlocale()
[1] "LC_COLLATE=Spanish_Spain.1252;LC_CTYPE=Spanish_Spain.1252;LC_MONETARY=Spanish_Spain.1252;LC_NUMERIC=C;LC_TIME=Spanish_Spain.1252"
> readLines("filename.txt")
[1] "ÿþE" "" "" "" "" ...
> readLines("filename.txt",encoding="UTF-8")
[1] "\xff\xfeE" "" "" "" "" ...
> readLines("filename.txt",encoding="UCS2LE")
[1] "ÿþE" "" "" "" "" "" "" ...
> readLines("filename.txt",encoding="UCS2")
[1] "ÿþE" "" "" "" "" ...
Any ideas?
Thanks!!
edit: the "UTF-16", "UTF-16LE" and "UTF-16BE" encondings fails similarly
'\xff\xfe'
is theUTF-16LE
encoding of the byte order mark (BOM) character. Decoding with UTF-8 should fail as FFh is an invalid start byte, but I'm not familiar with R. – Frigidscan
than I didreadLines
. Tryscan("filename.txt", fileEncoding="UCS-2LE", sep="\n")
– Accentuatescan
does read the file (and I don't understand the difference between thefileEncoding
andencoding
params), but it creates other problems. First, it only takes "one byte separators", and if you use an absurd separator it falls back to space as a sep. Also, it strips the \r\n that I need to preserve. And finally, for some reasonpaste
fails to concatenate the string (it just returns the original vector). – Standard