Apache Tika App configuration file
Asked Answered
W

1

6

I'm using Apache Tika App on my Ubuntu 16.04 Server as a comand line tool to extract content of documents.

The [Apache Tika website][1] says the following:

Build artifacts

The Tika build consists of a number of components and produces the following main binaries:

tika-core/target/tika-core-*.jar Tika core library. Contains the core interfaces and classes of Tika, but none of the parser implementations. Depends only on Java 6.

tika-parsers/target/tika-parsers-*.jar Tika parsers. Collection of classes that implement the Tika Parser interface based on various external parser libraries.

tika-app/target/tika-app-*.jar Tika application. Combines the above components and all the external parser libraries into a single runnable jar with a GUI and a command line interface.

So I have downloaded the last verstion (1.18) of tika-app-*.jar. That was just a single file.

Running this in a command line like java -jar tika-app-1.18.jar -t <filename> gives me the needed output of the file content but also each time I get two warnings:

July 28, 2018 3:29:27 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies.

July 28, 2018 3:29:27 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: org.xerial's sqlite-jdbc is not loaded. Please provide the jar on your classpath to parse sqlite files. See tika-parsers/pom.xml for the correct version.

I don't know if those warning slow things down but it is hard to follow other output amongst those repetative warnings.

I have tried to point Tika to my own configuration file by:

java -jar tika-app-1.18.jar --config=tika-config.xml -t <filename>

My tika-config.xml file is:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <mime-exclude>image/jpeg</mime-exclude>
      <mime-exclude>application/x-sqlite3</mime-exclude>
      <parser-exclude class="org.apache.tika.parser.jdbc.SQLite3Parser"/>
    </parser>
  </parsers>
</properties>

If I use that config I get No protocol: filename.doc and the warnings are still in place.

How to exclude jpeg and sqlite parsers?

Wismar answered 28/7, 2018 at 15:18 Comment(6)
Did you read and follow tika.apache.org/1.18/configuring.html?Incidentally
@Incidentally Thank you, no I didn't read that. So based on that I'm correctly feeding the configuration file. I can probably use ` <mime-exclude>image/jpeg</mime-exclude>` to avoid images to be parsed. I would probably need a default config file, do I still use content of POM.XML? And sqlite parsers probably gets excluded the same way as images, correct?Wismar
You only need pom.xml if you are compiling Tika yourself, which you don't need to do when configuring the app!Incidentally
@Incidentally Ok, I get it. But I try to make a config file just with the first exmple on how parsers can be configured and then I do java -jar tika-app-1.18.jar --config=tika-config.xml -t <filename> and I get No protocol: filename.doc And then what is mime type for sqlite files?Wismar
@Incidentally I have updated my question based on the link you gave meWismar
Those warnings come at initialisation time, you're excluding things at parse time. You probably just want to follow tika.apache.org/1.18/configuring.html#Load_Error_Handling to turn off the warningsIncidentally
D
3

My solution was this tika-config.xml file:

 <?xml version="1.0" encoding="UTF-8"?>
 <properties>
   <service-loader loadErrorHandler="IGNORE"/>
   <service-loader initializableProblemHandler="ignore"/>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
    <mime-exclude>image/jpeg</mime-exclude>
    <mime-exclude>application/x-sqlite3</mime-exclude>
    <parser-exclude class="org.apache.tika.parser.jdbc.SQLite3Parser"/>
   </parser>
  </parsers>
  </properties>

and then set:

export TIKA_CONFIG=/path/to/tika-config.xml

in my .bashrc file.

Deandre answered 3/6, 2019 at 19:19 Comment(1)
For some reason in Windows, the path separator leads to a wrong file pathSeraphine

© 2022 - 2024 — McMap. All rights reserved.