Export UTF-8 BOM to .csv in R

Asked 13/9, 2011 at 13:2 Answered 31/12, 2016 at 11:56

Solved r utf-8 byte-order-mark export-to-csv

I am reading a file through RJDBC from a MySQL database and it correctly displays all letters in R (e.g., נווה שאנן). However, even when exporting it using write.csv and fileEncoding="UTF-8" the output looks like <U+0436>.<U+043A>. <U+041B><U+043E><U+0437><U+0435><U+043D><U+0435><U+0446>(in this case this is not the string above but a Bulgarian one) for Bulgarian, Hebrew, Chinese and so on. Other special characters like ã,ç etc work fine.

I suspect this is because of UTF-8 BOM but I did not find a solution on the net

My OS is a German Windows7.

edit: I tried

con<-file("file.csv",encoding="UTF-8")
write.csv(x,con,row.names=FALSE)

and the (afaik) equivalent write.csv(x, file="file.csv",fileEncoding="UTF-8",row.names=FALSE).

Shornick answered 13/9, 2011 at 13:2 Comment(3)

Are you saying that when you open the exported file, you see "U+0436" instead of "ж"? If so that's no BOM issue, just an issue of the Unicode code points not being encoded into a UTF encoding, but output as code points. Maybe show us some code how exactly you're exporting the file? – Polymorphism 13/9, 2011 at 13:29

I added information on how I exported the file. And yes, I see "<U+0436>" instead of "ж" – Shornick 14/9, 2011 at 8:29

Seeing "<U+0436>" in the file is ambiguous (it could even mean that those characters are actually inlined in that file or your editor just cannot display them). You could either write us the "ж" in a file and tell us the hex-values of all the characters the generated file contains (open it in a hex-editor); OR give us the code to reproduce your problem (of course we dont have your DB, so create a vector with the sample data). – Slave 14/9, 2011 at 10:19

On help page to Encoding (help("Encoding")) you could read about special encoding - bytes.

Using this I was able to generate csv file by:

v <- "נווה שאנן"
X <- data.frame(v1=rep(v,3), v2=LETTERS[1:3], v3=0, stringsAsFactors=FALSE)

Encoding(X$v1) <- "bytes"
write.csv(X, "test.csv", row.names=FALSE)

Take care about differences between factor and character. The following should work:

id_characters <- which(sapply(X,
    function(x) is.character(x) && Encoding(x)=="UTF-8"))
for (i in id_characters) Encoding(X[[i]]) <- "bytes"

id_factors <- which(sapply(X,
    function(x) is.factor(x) && Encoding(levels(x))=="UTF-8"))
for (i in id_factors) Encoding(levels(X[[i]])) <- "bytes"

write.csv(X, "test.csv", row.names=FALSE)

Sarilda answered 14/9, 2011 at 12:12 Comment(0)

The accepted answer did not help me in a similar application (R 3.1 in Windows, while I was trying to open the file in Excel). Anyway, based on this part of file documentation:

If a BOM is required (it is not recommended) when writing it should be written explicitly, e.g. by writeChar("\ufeff", con, eos = NULL) or writeBin(as.raw(c(0xef, 0xbb, 0xbf)), binary_con)

I came up with the following workaround:

write.csv.utf8.BOM <- function(df, filename)
{
    con <- file(filename, "w")
    tryCatch({
    for (i in 1:ncol(df))
        df[,i] = iconv(df[,i], to = "UTF-8") 
    writeChar(iconv("\ufeff", to = "UTF-8"), con, eos = NULL)
    write.csv(df, file = con)
    },finally = {close(con)})
}

Note that df is the data.frame and filename is the path to the csv file.

Momentum answered 31/12, 2016 at 11:56 Comment(2)

This is great. This should be the accepted answer (Windows 7, R version 3.4.2) – Rufescent 20/6, 2018 at 16:10

Still going fine on R 3.5.3. Just two small remarks: instead of the tryCatch() construct you could just use on.exit(close(con)).It might also be useful to pass fileEncoding = "utf-8" to write.csv() for best results. – Ptyalin 30/4, 2019 at 11:40

On help page to Encoding (help("Encoding")) you could read about special encoding - bytes.

Using this I was able to generate csv file by:

v <- "נווה שאנן"
X <- data.frame(v1=rep(v,3), v2=LETTERS[1:3], v3=0, stringsAsFactors=FALSE)

Encoding(X$v1) <- "bytes"
write.csv(X, "test.csv", row.names=FALSE)

Take care about differences between factor and character. The following should work:

id_characters <- which(sapply(X,
    function(x) is.character(x) && Encoding(x)=="UTF-8"))
for (i in id_characters) Encoding(X[[i]]) <- "bytes"

id_factors <- which(sapply(X,
    function(x) is.factor(x) && Encoding(levels(x))=="UTF-8"))
for (i in id_factors) Encoding(levels(X[[i]])) <- "bytes"

write.csv(X, "test.csv", row.names=FALSE)

Sarilda answered 14/9, 2011 at 12:12 Comment(0)

Recommended topics

Hot tags