Handling java.lang.OutOfMemoryError when writing to Excel from R
Asked Answered
U

6

94

The xlsx package can be used to read and write Excel spreadsheets from R. Unfortunately, even for moderately large spreadsheets, java.lang.OutOfMemoryError can occur. In particular,

Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, :
java.lang.OutOfMemoryError: Java heap space

Error in .jcall("RJavaTools", "Ljava/lang/Object;", "newInstance", .jfindClass(class), :
java.lang.OutOfMemoryError: GC overhead limit exceeded

(Other related exceptions are also possible but rarer.)

A similar question was asked regarding this error when reading spreadsheets.

Importing a big xlsx file into R?

The main advantage of using Excel spreadsheets as a data storage medium over CSV is that you can store multiple sheets in the same file, so here we consider a list of data frames to be written one data frame per worksheet. This example dataset contains 40 data frames, each with two columns of up to 200k rows. It is designed to be big enough to be problematic, but you can change the size by altering n_sheets and n_rows.

library(xlsx)
set.seed(19790801)
n_sheets <- 40
the_data <- replicate(
  n_sheets,
  {
    n_rows <- sample(2e5, 1)
    data.frame(
      x = runif(n_rows),
      y = sample(letters, n_rows, replace = TRUE)
    )
  },
  simplify = FALSE
)
names(the_data) <- paste("Sheet", seq_len(n_sheets))

The natural method of writing this to file is to create a workbook using createWorkbook, then loop over each data frame calling createSheet and addDataFrame. Finally the workbook can be written to file using saveWorkbook. I've added messages to the loop to make it easier to see where it falls over.

wb <- createWorkbook()  
for(i in seq_along(the_data))
{
  message("Creating sheet", i)
  sheet <- createSheet(wb, sheetName = names(the_data)[i])
  message("Adding data frame", i)
  addDataFrame(the_data[[i]], sheet)
}
saveWorkbook(wb, "test.xlsx")  

Running this in 64-bit on a machine with 8GB RAM, it throws the GC overhead limit exceeded error while running addDataFrame for the first time.

How do I write large datasets to Excel spreadsheets using xlsx?

Unopened answered 21/2, 2014 at 14:52 Comment(0)
U
90

This is a known issue: http://code.google.com/p/rexcel/issues/detail?id=33

While unresolved, the issue page links to a solution by Gabor Grothendieck suggesting that the heap size should be increased by setting the java.parameters option before the rJava package is loaded. (rJava is a dependency of xlsx.)

options(java.parameters = "-Xmx1000m")

The value 1000 is the number of megabytes of RAM to allow for the Java heap; it can be replaced with any value you like. My experiments with this suggest that bigger values are better, and you can happily use your full RAM entitlement. For example, I got the best results using:

options(java.parameters = "-Xmx8000m")

on the machine with 8GB RAM.

A further improvement can be obtained by requesting a garbage collection in each iteration of the loop. As noted by @gjabel, R garbage collection can be performed using gc(). We can define a Java garbage collection function that calls the Java System.gc() method:

jgc <- function()
{
  .jcall("java/lang/System", method = "gc")
}    

Then the loop can be updated to:

for(i in seq_along(the_data))
{
  gc()
  jgc()
  message("Creating sheet", i)
  sheet <- createSheet(wb, sheetName = names(the_data)[i])
  message("Adding data frame", i)
  addDataFrame(the_data[[i]], sheet)
}

With both these code fixes, the code ran as far as i = 29 before throwing an error.

One technique that I tried unsuccessfully was to use write.xlsx2 to write the contents to file at each iteration. This was slower than the other code, and it fell over on the 10th iteration (but at least part of the contents were written to file).

for(i in seq_along(the_data))
{
  message("Writing sheet", i)
  write.xlsx2(
    the_data[[i]], 
    "test.xlsx", 
    sheetName = names(the_data)[i], 
    append    = i > 1
  )
}
Unopened answered 21/2, 2014 at 14:53 Comment(6)
This whole problem can now be sidestepped by swapping the xlsx package for the openxlsx package, which is dependent upon Rcpp rather than Java.Unopened
readxl is another new C/C++ alternative that looks promising.Unopened
unfortunately I've found both of those are quite junk for detecting and reading dates--both end up in the incorrigible mess that is the Excel date format :\Laaland
@RichieCotton, nice alternative. However, openxlsx can not read .xls or .xlm files! (2007 excel file format).Swordcraft
call options(java.parameters = "-Xmx8000m") before load rJava, xlsxjars, xlsx solved Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, : org.apache.poi.POIXMLException: java.lang.reflect.InvocationTargetException Calls: getNetwork ... <Anonymous> -> .jrcall -> .jcall -> .jcheck -> .Call Execution halted in RHEL 6.3 x86_64, java 1.7.0_79 (Oracle), rJava_0.9-7, xlsxjars_0.6.0, xlsx_0.5.7Wire
@RichieCotton also openxlsx and readxl don't have options for reading password protected .xlsxMalediction
L
8

Building on @richie-cotton answer, I found adding gc() to the jgc function kept the CPU usage low.

jgc <- function()
{
  gc()
  .jcall("java/lang/System", method = "gc")
}    

My previous for loop still struggled with the original jgc function, but with extra command, I no longer run into GC overhead limit exceeded error message.

Lorrimor answered 14/1, 2016 at 11:39 Comment(0)
M
2

Solution for the above error: Please use the below mentioned r - code:

detach(package:xlsx)
detach(package:XLConnect)
library(openxlsx)

And, try to import the file again and you will not get any error as it works for me.

Mown answered 11/10, 2017 at 11:23 Comment(1)
Two comments: xlConnect has the same problem. And more importantly, telling somebody to use a different library isn't a solution to the problem with the one being referenced. The goal here is to stay within the xlsx package. There are other threads devoted to XLConnect.Dendrochronology
H
0

Restart R and, before loading the R packages, insert:

 options(java.parameters = "-Xmx2048m")  

or

options(java.parameters = "-Xmx8000m")
Hyacinthe answered 16/12, 2021 at 14:6 Comment(0)
T
-1

You can also use gc() inside the loop if you are writing row by row. gc() stands for garbage collection. gc() can be used in any case of memory issue.

Trackman answered 22/8, 2017 at 1:29 Comment(0)
G
-1

I was having issues with write.xlsx() rather than reading.... but then realised that I had accidentally been running 32bit R. Swapping it out to 64bit has fixed the issue.

Gallivant answered 10/1, 2020 at 14:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.