How to read large files using TIka?
Asked Answered
G

3

20

I'm parsing large pdf and word documents using Tika but I get he followiing error message.

Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).

How can I increase the limit?

Gaston answered 26/6, 2015 at 18:2 Comment(1)
Depends entirely on how you're calling Apache Tika. How are you calling Apache Tika?Wool
W
31

Assuming you're basically following the Tika example for extracting to plain text, then all you need to do is create your BodyContentHandler with a write limit of -1 to disable the write limit, as explained in the javadocs

Your code would then look something like (inspired by the example):

BodyContentHandler handler = new BodyContentHandler(-1);

InputStream stream = ContentHandlerExample.class.getResourceAsStream("test.doc");
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
try {
    parser.parse(stream, handler, metadata);
    return handler.toString();
} finally {
    stream.close();
}
Wool answered 27/6, 2015 at 16:18 Comment(1)
This solution worked for even 34 million characters document.Telefilm
A
5

I disagree with @Gagravarr using the write limit of -1, as the default that will be selected in -1 case is infact 100000 to be exact.

If i am not wrong, the documentation of Tika BodyContentHandler>WriteOutContentHandler states that:

The internal string buffer is bounded at 100k characters.

However the best way to achieve this is to pass an object of StringWriter as an argument in place of -1.

StringWriter any = new StringWriter();

and then

BodyContentHandler handler = new BodyContentHandler(any);

Allsun answered 17/3, 2018 at 22:58 Comment(1)
According to the code @Wool seems to be right. Specifying -1 for write limit does not set the limit to 100k characters, rather sets no limits.Horrific
I
1

Both mentioned decisions by @Gagravarr and by @Saad are identically equal. Because under the hood if you would go to the source code, you can see that:

  1. BodyContentHandler(-1); constructs from WriteOutContentHandler(writeLimit), which in its turn calls WriteOutContentHandler(new StringWriter(), writeLimit);.

  2. BodyContentHandler(Writer writer); constructs from WriteOutContentHandler(writer), which in its turn calls WriteOutContentHandler(writer, -1);.

So, according to the principle of least astonishment and that less code is better, I would rather use the 1-st option recommended by @Gagravarr.

Ieyasu answered 6/10, 2021 at 14:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.