SXSSFWorkbook.write to FileOutputStream writes huge files
Asked Answered
U

1

7

I'm trying to use SXSSFWorkbook to write an Excel spreadsheet from scratch.

      wb = SXSSFWorkbook(500)
      wb.isCompressTempFiles = true
      sh = streamingWorkbook.createSheet(t.getMessage("template.sheet.name"))

All is fine but when I invoke the final code:

    val out = FileOutputStream(localPath)
    wb.write(out)
    out.close()
    // dispose of temporary files backing this workbook on disk
    wb.dispose()

I get a huge excel file instead of the zipped XLSX I'm expecting. I've tried to manually zip the file and from a 120MB file I can get it down to 9MB. So what am I missing?

Using: Kotlin and

    implementation group: 'org.apache.poi', name: 'poi-ooxml', version: '4.1.2'  // For `.xlsx` files

-- Update 1

I was under the impression that xlsx are zipped files containing xml data in essence [1]. What POI outputs through both XSSFWorkbook and SXSSFWorkbook can be compressed at least by an order of magnitude of 10. I've used this simple code to demonstrate:

fun main() {
  val workbook = XSSFWorkbook()
  writeRowsAndSave(workbook, "test.xlsx")
  workbook.close()

  val streamingWorkbook = SXSSFWorkbook(IN_MEMORY_ROWS_WINDOW_SIZE)
  streamingWorkbook.isCompressTempFiles = true
  writeRowsAndSave(streamingWorkbook, "test-streaming.xlsx")
  streamingWorkbook.dispose()
}

private fun writeRowsAndSave(workbook: Workbook, fileName: String) {
  val ROWS_COUNT = 2_000
  val COLS_COUNT = 1_000

  val sheet = workbook.createSheet("Test Sheet 1")
  for (i in 1..ROWS_COUNT) {
    val row = sheet.createRow(i)
    println("Row $i")
    for(j in 1..COLS_COUNT) {
        row.createCell(j).setCellValue("Test $i")
    }
  }

  FileOutputStream("./$fileName").use {
      workbook.write(it)
  }
}

This produces 5MB-each files that when compressed have roughly 439KB (?!).

Undermanned answered 31/3, 2020 at 14:36 Comment(5)
I think you should show more of your code here... One thing to worry about is the missing call of wb.close(), but it might not be necesary when using SXSSFWorkbook while I found out it's necessary at XSSFWorkbook.Jollity
I've tried, wb.close doesn't seem to do much :(Undermanned
Cannot reproducing only having that simple code snippets. SXSSFWorkbook defaults to using inline strings instead of a shared strings table. That means SXSSFWorkbook writes text directly in the sheet even if it is multiple times the same text. XSSFWorkbook and Excel's GUI both are using a shared strings table where text gets an index and the same text is stored only once and the index is used in the sheet then. But that should not have that big impact to the file size of the resulting *.xlsx.Rudderhead
I'm creating huge files that have only a few string types so it can have a huge impact!Undermanned
Also found this: #47851578 but it's somewhat unrelated.Undermanned
R
6

SXSSFWorkbook defaults to using inline strings instead of a shared strings table. That means SXSSFWorkbook writes text directly in the sheet even if it is multiple times the same text. XSSFWorkbook and Excel's GUI both are using a shared strings table where text gets an index and the same text is stored only once and the index is used in the sheet then. But that should not have that big impact to the file size of the resulting *.xlsx.

SXSSFWorkbook, as well as all other Office Open XML formatted files, apache poi creates, were zipped using org.apache.commons.compress.archivers.zip.ZipArchiveOutputStream. That uses deflate as compression algorithm and Deflater.DEFAULT_COMPRESSION as the default compression level. One could overwrite protected ZipArchiveOutputStream createArchiveOutputStream(OutputStream out) of SXSSFWorkbook to set another compression level. But that also should not have that big impact to the file size of the resulting *.xlsx.

Example Java code:

import java.io.File;
import java.io.OutputStream;
import java.io.FileOutputStream;

import org.apache.poi.ss.usermodel.*;
import org.apache.poi.xssf.streaming.SXSSFWorkbook;

import org.apache.commons.compress.archivers.zip.Zip64Mode;
import org.apache.commons.compress.archivers.zip.ZipArchiveOutputStream;
import java.util.zip.Deflater;

class CreateSXSSFDifferentCompression {

 static SXSSFWorkbook createSXSSFWorkbook(int compressionLevel, int rowAccessWindowSize, 
                                          boolean compressTmpFiles, boolean useSharedStringsTable) {
  SXSSFWorkbook workbook = null;
  if (compressionLevel != Deflater.DEFAULT_COMPRESSION) {
   workbook = new SXSSFWorkbook(null, rowAccessWindowSize, compressTmpFiles, useSharedStringsTable) {
    protected ZipArchiveOutputStream createArchiveOutputStream(OutputStream out) {
     ZipArchiveOutputStream zos = new ZipArchiveOutputStream(out);
     zos.setUseZip64(Zip64Mode.AsNeeded);  
     zos.setLevel(compressionLevel);
     return zos;
    }    
   }; 
  } else {
   workbook = new SXSSFWorkbook(null, rowAccessWindowSize, compressTmpFiles, useSharedStringsTable);
  }
  return workbook;
 }

 public static void main(String[] args) throws Exception {

  SXSSFWorkbook workbook = null;

  // uses Deflater.DEFAULT_COMPRESSION and inline strings
  //workbook = createSXSSFWorkbook(Deflater.DEFAULT_COMPRESSION, 500, true, false); 

  // uses Deflater.DEFAULT_COMPRESSION and shared strings table
  //workbook = createSXSSFWorkbook(Deflater.DEFAULT_COMPRESSION, 500, true, true); 

  // uses Deflater.BEST_COMPRESSION and inline strings
  workbook = createSXSSFWorkbook(Deflater.BEST_COMPRESSION, 500, true, false); 

  // uses Deflater.BEST_COMPRESSION and shared strings table
  //workbook = createSXSSFWorkbook(Deflater.BEST_COMPRESSION, 500, true, true); 

  int ROWS_COUNT = 2000;
  int COLS_COUNT = 1000;

  Sheet sheet = workbook.createSheet("Test Sheet 1");
  for (int i = 1 ; i <= ROWS_COUNT; i++) {
   Row row = sheet.createRow(i);
   //System.out.println("Row " + i);
   for(int j = 1; j <= COLS_COUNT; j++) {
    row.createCell(j).setCellValue("Test " + i);
   }
  }

  FileOutputStream out = new FileOutputStream("./Excel.xlsx");
  workbook.write(out);
  out.close();
  workbook.close();
  workbook.dispose();

  File file = new File("./Excel.xlsx");
  System.out.println(file.length());

 }
}

This results in Excel.xlsx file size of:

5,031,034 bytes when Deflater.DEFAULT_COMPRESSION and inline strings are used.

4,972,663 bytes when Deflater.DEFAULT_COMPRESSION and shared strings table are used.

4,972,915 bytes when Deflater.BEST_COMPRESSION and inline strings are used.

And 4,966,749 bytes when Deflater.BEST_COMPRESSION and shared strings table are used.

Used: Java 12, apache poi 4.1.2, Ubuntu Linux.

Neither I would call that huge for a spreadsheet of 2,000 rows x 1,000 columns nor I would call the impact of the different settings big.

And the entries are compressed very good.

If you look into the Excel.xlsx ZIP archive, you will find the uncompressed size of xl/worksheets/sheet1.xml 112,380,273 bytes when inline strings are used. The uncompressed size of xl/sharedStrings.xml is 138 bytes then and only contains the very basic XML.

If shared strings table is used, then the uncompressed size of xl/worksheets/sheet1.xml is 68,377,273 bytes and the uncompressed size of xl/sharedStrings.xml is 49,045 bytes and contains 2,000 entries.

If Excel itself saves *.xlsx files, it creates files having approximately the same file size when content is equal. So Excel itself uses the same compression level.

Of course one can compressing the *.xlsx files more when storing the Excel.xlsx into a *.zip archive again. But that would not be what Excel expects to be a *.xlsx file.

Microsoft states at What are the benefits of Open XML Formats?:

Compact files Files are automatically compressed and can be up to 75 percent smaller in some cases. The Open XML Format uses zip compression technology to store documents, offering potential cost savings as it reduces the disk space required to store files and decreases the bandwidth needed to send files via e-mail, over networks, and across the Internet. When you open a file, it is automatically unzipped. When you save a file, it is automatically zipped again. You do not have to install any special zip utilities to open and close files in Office.

The important parts here are:

When you open a file, it is automatically unzipped. When you save a file, it is automatically zipped again.

That means, if apache poi would compress the files using other methods or compression levels than Microsoft Office itself, then Microsoft Office would be unable to do so with files apache poi had created.

So, since apache poi creates files which Excel (Microsoft Office) is able opening directly, it uses the same compression method and compression level as Excel (Microsoft Office) will do.

Rudderhead answered 1/4, 2020 at 17:35 Comment(1)
The "huge" in the title refers to sheets with 2500+ columns and 10k+ rows which go into 120+MB realm. Whatever compression POI and Excel are doing seems woefully inadequate compared to zipping the file externally which takes that filesize down to ~10MBs.Undermanned

© 2022 - 2024 — McMap. All rights reserved.