I'm using Apache Tika App on my Ubuntu 16.04 Server as a comand line tool to extract content of documents.
The [Apache Tika website][1] says the following:
Build artifacts
The Tika build consists of a number of components and produces the following main binaries:
tika-core/target/tika-core-*.jar Tika core library. Contains the core interfaces and classes of Tika, but none of the parser implementations. Depends only on Java 6.
tika-parsers/target/tika-parsers-*.jar Tika parsers. Collection of classes that implement the Tika Parser interface based on various external parser libraries.
tika-app/target/tika-app-*.jar Tika application. Combines the above components and all the external parser libraries into a single runnable jar with a GUI and a command line interface.
So I have downloaded the last verstion (1.18) of tika-app-*.jar
. That was just a single file.
Running this in a command line like java -jar tika-app-1.18.jar -t <filename>
gives me the needed output of the file content but also each time I get two warnings:
July 28, 2018 3:29:27 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies.
July 28, 2018 3:29:27 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: org.xerial's sqlite-jdbc is not loaded. Please provide the jar on your classpath to parse sqlite files. See tika-parsers/pom.xml for the correct version.
I don't know if those warning slow things down but it is hard to follow other output amongst those repetative warnings.
I have tried to point Tika to my own configuration file by:
java -jar tika-app-1.18.jar --config=tika-config.xml -t <filename>
My tika-config.xml file is:
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser">
<mime-exclude>image/jpeg</mime-exclude>
<mime-exclude>application/x-sqlite3</mime-exclude>
<parser-exclude class="org.apache.tika.parser.jdbc.SQLite3Parser"/>
</parser>
</parsers>
</properties>
If I use that config I get No protocol: filename.doc
and the warnings are still in place.
How to exclude jpeg and sqlite parsers?
pom.xml
if you are compiling Tika yourself, which you don't need to do when configuring the app! – Incidentallyjava -jar tika-app-1.18.jar --config=tika-config.xml -t <filename>
and I getNo protocol: filename.doc
And then what is mime type for sqlite files? – Wismar