Encoding issue with write.xlsx (openxlsx)

Asked 12/9, 2018 at 16:24 Answered 9/1, 2020 at 12:9

I use the write.xlsx() function (from the openxlsx package) to turn a list object into an excel spreadsheet, where each element of the list is converted into a "sheet" of the excel file. In the past, this function has been incredibly useful, and I have never encountered any issues. It is my understanding that this package and function in particular does not need any particular java update on the computer in order for it to work.

However, recently I discovered that the function is producing error. This is what it states in the console when I run the write.xlsx() for the list:

Error in gsub("&", "&amp;", v, fixed = TRUE) : 
  input string 5107 is invalid UTF-8

I've identified the dataframes that are the cause of the issue, but I am not sure how to identify which part of the data frame is causing the error.

I've even went ahead and used the enc2utf8() function for all of the columns in this data frame in particular but I still encounter the error. I've used the substr() function on the data frame itself, and it shows me the first n characters of each column, though I do not see any obvious issues from the output.

I've even went ahead and used the install.packages() function to re-download the openxlsx package again, in case of any updates.

Does anyone know how I would go about identifying the cause of the error? Is it the function as it is written in the package? If the problem is in the encoding of the data itself, does the enc2utf8() not suffice to resolve the issue?

Thanks!

Inquisition answered 12/9, 2018 at 16:24 Comment(3)

This isn't a java issue. There's a non-UTF8 character string somewhere in your data frame that's causing the error. I'm not sure how to determine which specific row and column has the offending string, but I've found that the iconv function resolves these issues when writing data frames with openxlsx::write.xlsx. For example, you can run the following on the character columns in your data frame: dat$col = iconv(dat$col, to="UTF-8"). – Edible 12/9, 2018 at 16:36

If you need to do this on a list of data frames, you can use lapply to iterate over the list and then within each data frame you can use lapply to run iconv on each of the character columns. – Edible 12/9, 2018 at 16:38

@Edible Thank you so much! My script finally works. Apparently, this icon() is the function that I needed. I read in all of my text files with the sapply(..., iconv, to = "UTF-8") so that performing the lapply() was not necessary. Issue resolved! – Inquisition 12/9, 2018 at 20:35

I just had this same problem. Building on this question, you could replace all bad characters in the dataframe with either:

library(dplyr)
df %>%
  mutate_if(is.character, ~gsub('[^ -~]', '', .))

for only character columns, or:

df %>%
  mutate_all(~gsub('[^ -~]', '', .))

for all columns, and then export to XLSX with write.xlsx().

Hunsaker answered 14/9, 2018 at 13:35 Comment(1)

Odd to me that stringi::stri_enc_toutf8() does not work but simply removing these characters does. – Embrocation 6/3, 2023 at 5:37

As far as finding the error, the number given points you to the problem (in your case, 5107). This appears to be counting the strings that are written to the file. To find the particular data point that's the issue, this approach worked for me:

Let's assume our data frame has 20 variables and 10 of them are character type.

Subtract the number of variables, if you are writing the column headers (because all of those are strings) 5107-20 = 5087
Divide the remainder by the number of character variables per observation (5087/10 = 508.7); that means that the problem is in row 509 (because there are 5080+20=5100 strings between the headers and the first 508 rows)
The 7th character variable in the 509th row will be your problem child.

Ioannina answered 9/1, 2020 at 12:9 Comment(0)

Recommended topics

Hot tags