SXSSFWorkbook
defaults to using inline strings instead of a shared strings table. That means SXSSFWorkbook
writes text directly in the sheet even if it is multiple times the same text. XSSFWorkbook
and Excel's GUI both are using a shared strings table where text gets an index and the same text is stored only once and the index is used in the sheet then. But that should not have that big impact to the file size of the resulting *.xlsx
.
SXSSFWorkbook
, as well as all other Office Open XML
formatted files, apache poi
creates, were zipped using org.apache.commons.compress.archivers.zip.ZipArchiveOutputStream
. That uses deflate as compression algorithm and Deflater.DEFAULT_COMPRESSION
as the default compression level. One could overwrite protected ZipArchiveOutputStream createArchiveOutputStream(OutputStream out)
of SXSSFWorkbook
to set another compression level. But that also should not have that big impact to the file size of the resulting *.xlsx
.
Example Java
code:
import java.io.File;
import java.io.OutputStream;
import java.io.FileOutputStream;
import org.apache.poi.ss.usermodel.*;
import org.apache.poi.xssf.streaming.SXSSFWorkbook;
import org.apache.commons.compress.archivers.zip.Zip64Mode;
import org.apache.commons.compress.archivers.zip.ZipArchiveOutputStream;
import java.util.zip.Deflater;
class CreateSXSSFDifferentCompression {
static SXSSFWorkbook createSXSSFWorkbook(int compressionLevel, int rowAccessWindowSize,
boolean compressTmpFiles, boolean useSharedStringsTable) {
SXSSFWorkbook workbook = null;
if (compressionLevel != Deflater.DEFAULT_COMPRESSION) {
workbook = new SXSSFWorkbook(null, rowAccessWindowSize, compressTmpFiles, useSharedStringsTable) {
protected ZipArchiveOutputStream createArchiveOutputStream(OutputStream out) {
ZipArchiveOutputStream zos = new ZipArchiveOutputStream(out);
zos.setUseZip64(Zip64Mode.AsNeeded);
zos.setLevel(compressionLevel);
return zos;
}
};
} else {
workbook = new SXSSFWorkbook(null, rowAccessWindowSize, compressTmpFiles, useSharedStringsTable);
}
return workbook;
}
public static void main(String[] args) throws Exception {
SXSSFWorkbook workbook = null;
// uses Deflater.DEFAULT_COMPRESSION and inline strings
//workbook = createSXSSFWorkbook(Deflater.DEFAULT_COMPRESSION, 500, true, false);
// uses Deflater.DEFAULT_COMPRESSION and shared strings table
//workbook = createSXSSFWorkbook(Deflater.DEFAULT_COMPRESSION, 500, true, true);
// uses Deflater.BEST_COMPRESSION and inline strings
workbook = createSXSSFWorkbook(Deflater.BEST_COMPRESSION, 500, true, false);
// uses Deflater.BEST_COMPRESSION and shared strings table
//workbook = createSXSSFWorkbook(Deflater.BEST_COMPRESSION, 500, true, true);
int ROWS_COUNT = 2000;
int COLS_COUNT = 1000;
Sheet sheet = workbook.createSheet("Test Sheet 1");
for (int i = 1 ; i <= ROWS_COUNT; i++) {
Row row = sheet.createRow(i);
//System.out.println("Row " + i);
for(int j = 1; j <= COLS_COUNT; j++) {
row.createCell(j).setCellValue("Test " + i);
}
}
FileOutputStream out = new FileOutputStream("./Excel.xlsx");
workbook.write(out);
out.close();
workbook.close();
workbook.dispose();
File file = new File("./Excel.xlsx");
System.out.println(file.length());
}
}
This results in Excel.xlsx
file size of:
5,031,034 bytes
when Deflater.DEFAULT_COMPRESSION and inline strings are used.
4,972,663 bytes
when Deflater.DEFAULT_COMPRESSION and shared strings table are used.
4,972,915 bytes
when Deflater.BEST_COMPRESSION and inline strings are used.
And
4,966,749 bytes
when Deflater.BEST_COMPRESSION and shared strings table are used.
Used: Java 12
, apache poi 4.1.2
, Ubuntu Linux
.
Neither I would call that huge for a spreadsheet of 2,000 rows x 1,000 columns nor I would call the impact of the different settings big.
And the entries are compressed very good.
If you look into the Excel.xlsx
ZIP archive, you will find the uncompressed size of xl/worksheets/sheet1.xml
112,380,273 bytes when inline strings are used. The uncompressed size of xl/sharedStrings.xml
is 138 bytes then and only contains the very basic XML.
If shared strings table is used, then the uncompressed size of xl/worksheets/sheet1.xml
is 68,377,273 bytes and the uncompressed size of xl/sharedStrings.xml
is 49,045 bytes and contains 2,000 entries.
If Excel
itself saves *.xlsx
files, it creates files having approximately the same file size when content is equal. So Excel
itself uses the same compression level.
Of course one can compressing the *.xlsx
files more when storing the Excel.xlsx
into a *.zip
archive again. But that would not be what Excel
expects to be a *.xlsx
file.
Microsoft
states at What are the benefits of Open XML Formats?:
Compact files Files are automatically compressed and can be up to 75 percent smaller in some cases. The Open XML Format uses zip
compression technology to store documents, offering potential cost
savings as it reduces the disk space required to store files and
decreases the bandwidth needed to send files via e-mail, over
networks, and across the Internet. When you open a file, it is
automatically unzipped. When you save a file, it is automatically
zipped again. You do not have to install any special zip utilities to
open and close files in Office.
The important parts here are:
When you open a file, it is automatically unzipped. When you save a
file, it is automatically zipped again.
That means, if apache poi
would compress the files using other methods or compression levels than Microsoft Office
itself, then Microsoft Office
would be unable to do so with files apache poi
had created.
So, since apache poi
creates files which Excel
(Microsoft Office
) is able opening directly, it uses the same compression method and compression level as Excel
(Microsoft Office
) will do.
wb.close()
, but it might not be necesary when usingSXSSFWorkbook
while I found out it's necessary atXSSFWorkbook
. – JollitySXSSFWorkbook
defaults to using inline strings instead of a shared strings table. That meansSXSSFWorkbook
writes text directly in the sheet even if it is multiple times the same text.XSSFWorkbook
and Excel's GUI both are using a shared strings table where text gets an index and the same text is stored only once and the index is used in the sheet then. But that should not have that big impact to the file size of the resulting*.xlsx
. – Rudderhead