Using POI or Tika to extract text, stream-to-stream without loading the entire file in memory
Asked Answered
E

1

2

I'm trying to use either Apache POI and PDFBox by themselves, or within the context of Apache Tika, to extract and process plain text from MASSIVE Microsoft Office and PDF files (i.e. hundreds of megs in some cases). Also, my application is multi-threaded, so I will be parsing many of these large files concurrently.

At that scale, I MUST work with the files in a streaming manner. It's not an option to hold an entire file in main memory at any step along the way.

I have seen many source code examples for loading files into Tika / POI / PDFBox via input streams. I have seen many examples for extracting plain text via output streams. However, I've performed some basic memory profiling experiments... and I haven't yet found a way with any of these libraries (Tika, POI, or PDFBox) to avoid loading an entire document into main memory.

In between reading from a stream and writing to a stream, there is obviously conversion step in the middle... which I have not yet found a way to perform on a streaming basis. Am I missing something, or is this a known issue with extracting text from MS Office or PDF files using Tika / POI / PDFBox? Can I have true end-to-end streaming, without a file being fully loaded into main memory at any point along the way?

Epicycle answered 30/7, 2014 at 17:57 Comment(5)
Wouldn't this with POI be what you're looking for? The event-driven one?Hylotheism
Apache Tika will happily process Excel files (.xls and .xlsx) in a completely streaming manner, assuming you've given it a non-buffering ContentHandler to output to. Did you try with Excel? And did you make sure you gave a sensible non-buffering ContentHandler to accept text into?Winkelman
@Gagravarr: Do you have an example of a "non-buffering ContentHandler"? I've tried writing my own implementation of the "org.xml.sax.ContentHandler" interface, which does nothing more than System.out the characters passed to its "characters()" method... but the full Excel file is still loaded into memory. I then tried using "org.apache.tika.sax.WriteOutContentHandler", which seems basically meant to do the same thing, but had the same result.Epicycle
Make sure you use a TikaInputStream backed by a File, rather than a regular InputStream, otherwise there will still be buffering on the input sideWinkelman
AFAIK PDFBox parses the whole PDF into memory when opening it.Checani
W
9

The first thing to make sure, if you care about the memory footprint, is that you're using a TikaInputStream backed by a File, eg change from something like

InputStream input = new FileInputStream("foo.xls");

To something like

InputStream input = TikaInputStream.get(new File("foo.xls"));

If you really only have an InputStream, not a file, and you want the lower memory option if possible, force Tika to buffer it to a temp file with something like

InputStream origInput = getAnInputStream();
TikaInputStream input = TikaInputStream.get(origInput);
input.getFile();

Many, but not all parsers will be able to take advantage of the backing File and read only the bits they need into memory, rather than buffering the whole thing, which'll help

.

Next up, make sure your ContentHandler doesn't buffer the whole contents into memory before outputting. Anything which does XPath lookups on the resulting document is probably out, as is anything which has an internal StringBuffer or similar. Pick a simpler one, and make sure you're setup to write the resulting html / text sax events somewhere as they come in

.

Finally, not all of the Tika parsers support streaming processing. Some only work by parsing the whole file's structure, then wandering through that finding the interesting bits to output. With those, using a File backed TikaInputStream will probably help, but won't stop a fair bit of memory being used.

IIRC, the low memory parsers include:

  • .xls
  • .xlsx
  • All ODF-based formats
  • XML

Some of the common document parsers which load + parse most/all of the file before being able to output anything include:

  • .doc / .docx / .ppt / .pptx
  • .pdf
  • Images
  • Videos
Winkelman answered 31/7, 2014 at 12:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.