Apache Tika maxStringLength reached
Asked Answered
N

1

11

l have thousands of pdf documents that are 11-15mb. My program says that my document contains more than 100k characters.

Error output:

Exception in thread "main" org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit.

How can l increase the limit to 10-15mb ?

I found a solution which is new Tika facade class but l could not find a way to integrate it with mine.

  Tika tika = new Tika(); 
  tika.setMaxStringLength(10*1024*1024);

Here is my code:

  BodyContentHandler handler = new BodyContentHandler();
  Metadata metadata = new Metadata();
  String location = "C:\\Users\\Laptop\\Dropbox\\MainTextbookTrappe2ndEd.pdf";
  FileInputStream inputstream = new FileInputStream(location);
  ParseContext pcontext = new ParseContext();
  PDFParser pdfparser = new PDFParser(); 
  pdfparser.parse(inputstream, handler, metadata, pcontext);

Output:

System.out.println("Content of the PDF :" + pcontext);
Natale answered 21/2, 2016 at 22:17 Comment(0)
D
32

Use

BodyContentHandler handler = new BodyContentHandler(-1);

to disable the limit. From the Javadoc:

The internal string buffer is bounded at the given number of characters. If this write limit is reached, then a SAXException is thrown.
Parameters: writeLimit - maximum number of characters to include in the string, or -1 to disable the write limit

Disgorge answered 22/2, 2016 at 5:53 Comment(2)
Thank you for the answer. I will try it when l am home. Isn't disabling the limit dangerous instead of limiting it? If user uploads 10gb trash pdf document, system will destroy or crash.Natale
@Ali19033 of course you can also simply increase the limit so that you just cover the size of your PDFs.Disgorge

© 2022 - 2024 — McMap. All rights reserved.