UTF-8 file output in R
Asked Answered
H

5

13

I'm using R 2.15.0 on Windows 7 64-bit. I would like to output unicode (CJK) text to a file.

The following code shows how a Unicode character sent to write on a UTF-8 file connection does not work as (I) expected:

rty <- file("test.txt",encoding="UTF-8")
write("在", file=rty)
close(rty)
rty <- file("test.txt",encoding="UTF-8")
scan(rty,what=character())
close(rty)

As shown by the output of scan:

Read 1 item 
[1] "<U+5728>"

The file was not written with the UTF character itself, but some kind of ANSI-compliant fallback. Can I make it work right the first time (i.e. with a text file that has "在" in it instead), or can I work some extra magic to convert the output to Unicode with the proper character replacing the code string?

Thanks.

[More info: the same code behaves properly in Cygwin, R 2.14.2, while 2.14.2 on Win7 is also broken. Is this on my end somewhere?]

Hercules answered 20/5, 2012 at 16:56 Comment(2)
[Belated update] The issues tend to be with locale rather than encoding. I have resolved gibberish output issues by temporarily changing locale to something "appropriate." God help you if you have language data from more than one locale.Hercules
maybe this post will help.Sloat
F
24

The problem is due to some R-Windows special behaviour (using the default system coding / or using some system write functions; I do not know the specifics but the behaviour is actually known)

To write text UTF8 encoding on Windows one has to use the useBytes=T options in functions like writeLines or readLines:

txt <- "在"
writeLines(txt, "test.txt", useBytes=T)

readLines("test.txt", encoding="UTF-8")
[1] "在"

Find here a really well written article by Kevin Ushey: http://kevinushey.github.io/blog/2018/02/21/string-encoding-and-r/ going into much more detail.

Fantinlatour answered 26/8, 2014 at 13:25 Comment(1)
Thanks! This worked for me. Lots of incomplete advice out there.Forensic
S
9

For anyone coming upon this question later, see the stringi package (https://cran.r-project.org/web/packages/stringi/index.html). It includes numerous functions to enable consistent, cross-platform UTF-8 string support in R. Most relevant to this thread, the stri_read_lines(), stri_read_raw(), and stri_write_lines() functions can consistently input/output UTF-8, even on Windows.

Satterfield answered 30/5, 2018 at 14:23 Comment(0)
O
8

Saves UTF-8 strings in text file:

kLogFileName <- "parser.log"
log <- function(msg="") {
  con <- file(kLogFileName, "a")
  tryCatch({
    cat(iconv(msg, to="UTF-8"), file=con, sep="\n")
  },
  finally = {
    close(con)
  })
}
Ovarian answered 5/5, 2013 at 15:11 Comment(2)
Did this break in more recent R versions? When I write files this way, I still have to set the encoding parameter of readLines to "ANSI" to get the correct file content. An example is "à" coming out as "\xe0" under UTF-8 encoding, but correctly under ANSI encoding when using readLines of the file createdAllomorph
@Curious - No, I ended up doing it manually using notepad++. I only needed to do it once for the files in one dataset and it was faster just to bite the bullet and do it manually then to keep messing with R file-encodings.Allomorph
V
0

I think you are having problems because write is constructed so that it takes the name of an object and you do not appear to have build such a named object. Try this instead:

txt <- "在"
rty <- file("test.txt",encoding="UTF-8")
write(txt, file=rty)
close(rty)
rty <- file("test.txt",encoding="UTF-8")
 inp <- scan(rty,what=character())
#Read 1 item
 close(rty)
 inp
#[1] "在"
Viscosity answered 20/5, 2012 at 21:31 Comment(1)
Hm, the original application that inspired the minimal snippet above used named objects. Moreover the code you provide above produces the same result for me as above. Perhaps I have a native encoding issue?Hercules
O
0

I have such problem with UTF-8 strings which come from DB.

The only way I've found to save them properly is saving file in binary mode.

  F <- file(file.name, "wb")
  tryCatch({
    writeBin(charToRaw(the_utf8_str), F)
  },
  finally = { 
    close(F)
  })
Ovarian answered 21/4, 2013 at 10:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.