Does POI XSSF still have crazy bad memory issues?
Asked Answered
R

1

11

A couple years ago, I ran into issues where I was creating large excel files using jXLS and POI XSSF. If my memory is correct, I think XSSF would create something like 1GB+ temp files on the disk to create 10mb excel files. So I stopped using jXLS and instead used SXSSF to create the excel files, but today I have new reasons to use jXLS or JETT.

Both jXLS and JETT websites seem to allude that performance is much better, but POI's XSSF website still says generically that the XSSF requires a higher memory footprint. I am wondering if this higher memory footprint is something like a reasonable 10% overhead these days, or if it is still like the 10,000% overhead as it was a couple years ago.

Are the crazy bad memory issues fixed with POI 3.9 XSSF? Should I not worry about using it with jXLS or JETT? Or are there certain gotchas to avoid? I am careful about reusing cell styles.

Repent answered 31/1, 2014 at 19:46 Comment(8)
This question feels very vague, but see for yourself what went into the release: poi.apache.org/changes.html. Looks like it wasn't explicitly addressed.Adai
No matter what you do, working with .xlsx will need more memory or temp files than .xls, as the overhead of all the XML and the compression are higher than the binary formatsLibb
@James: I don't know how to be less vague. Overhead of around 10% seems reasonable, overhead at 10,000% is not. I am trying to find out if the memory overhead is still around 10,000% like it used to be, or if it closer to 10%.Repent
@Gagravarr: .xls isn't capable of large excel files (it is limited to 65,536 rows), so it is not an option. Also, the memory footprint overhead using XSSFX is less that for HSSF, so it is not necessarily true that .xlsx will need more memory or temp files than .xls.Repent
@James: Sorry, I meant to also say that I've looked through that change document (and many other places). I've found nothing explicitly saying that the memory issue was fixed, but it is possible that there was a logic bug that was causing this issue that wasn't listed to the xml or memory. It is also possible that there was a bug in jxls that caused this (although I have heard from others that they turned away from XSSF because of memory issues unrelated to jxls). I was hoping others might know if this memory issue was gone so I wouldn't have to spend a lot of time testing.Repent
@Libb I've experienced that you can manage the memory usage with .xlsx file a lot more easily than with .xls. Also, for large files, xlsx use less memory when done properly.Declarative
Typing this in October of 2014, with POI 3.10 it still uses a huge amount of memory and it's still really slow.Plausible
I'm here in 2019 and having memory issues with XSSF with a file just short of 500000 rows with 12 columns. It's really embarrassing how slow Apache POI is and I'm actually running out of java heap spaceUnexpressive
C
10

To answer your question, yes, POI will always use very large amount of memory when working on large XLSX files, which is much larger than the size of the XLSX files. I don't think this will change anytime soon, and there are pretty obvious reasons for that: XLSX is basically a bunch of zipped XML files, and XML is very well compressed (around 10x). Getting this XML just to sit in memory uncompressed would already increase the memory consumption tenfold, so if you add all the overhead of data structures, there's no way you should expect a 10% increase in memory consumption over the XLSX file size.

Now, the good news is that as mentioned in the comments, Apache POI introduced SXSSF for streaming very large amount of data in a spreadsheet with very good performance and low memory usage. XLSX files generated this way are still streamed on the hard disk where they can end up taking quite a bit of space, but at least you don't risk OOME when writing hundreds of thousands of rows.

The problem for you is that you won't be able to get JETT to directly work with SXSSF, as it needs the whole document loaded in memory for performing template filling. JETT author quickly discussed this topic here.

I had the same problem, and ended up doing a two-step XLSX creation:

  1. A standard JETT XLSX template to generate headers and formatting. The last row of the first sheet contains cells with $$tokens$$, one per cell. I don't use JETT to insert the large amount of rows.

  2. Once JETT did its work, I reopen the workbook, read then delete the $$tokens$$ on the last line of the first spreadsheet, and start streaming data with SXSSF row by row.

Of course, there are limitations to this approach: - You cannot use JETT on any of the streamed rows during rows insertion (but you can before, to dynamically pick the order of the $$tokens$$ for example) - Cells format won't be copied unless you take care of it yourself with POI API. I personally prefer to format whole columns in my XLSX file, and it will apply to the streamed data.

This also works if you want to show charts using data inserted with SXSSF: You can define a Named Range with functions OFFSET and COUNTA, then create a Pivot table & Pivot Chart that will be refreshed when the XLSX is opened in Excel.

Christmastide answered 12/5, 2015 at 9:33 Comment(3)
I disagree that POI must use large amount of memory and/or enormous temp files. With the logic of your answer, then Excel, OpenOffice and LibreOffice would all suffer from the same memory issues.Repent
I think the real issue is that POI uses a poor internal data structure design-- I think in general, an xml file is 12x larger than it's zip. During report generation POI uses a temp file that is 1000x larger than the resulting xlsx-- so it's definitely not just compression that causing the problem.Repent
I got your point on Excel/Office/LibreOffice being able to load huge XLSX files without a problem, but I believe these softwares must use optimized internal structure to be able to efficiently store the data displayed in the application. POI on the other hand is a library that started with Interfaces for the XLS format, and kept the same interface to allow modifications of XLSX format, which is XML based. It does make sense to keep an equivalent of the XML structure to make changes, as it's very straightforward to write back the XLSX format rather than go through an optimized data structure.Christmastide

© 2022 - 2024 — McMap. All rights reserved.