The Apache Tika Server provides a Rest API to extract text from a document. It is also possible to set specific request header parameters like X-Tika-PDFOcrStrategy
. e.g:
$ curl -T test/Dokument01.pdf http://localhost:9998/tika --header "X-Tika-PDFOcrStrategy: ocr_only"
From a lot of different documents about tika I found these documented additional header parameters:
X-Tika-OCRLanguage: eng
X-Tika-PDFextractInlineImages: true | false
X-Tika-PDFOcrStrategy: ocr_only | ocr_and_text_extraction
X-Tika-OCRoutputType: hocr
But there seems to be no documentation about how to use the X-Tika-.....?
header parameters or which parameters are supported and which not.
For example I wonder if it is possible to overwrite the ImageType mode or the DPI with something like:
X-Tika-PDFocrImageType: rgb
X-Tika-PDFocrDPI: 100
My question is: Which header parameters are supported and which naming convention did these params follow?