python how to use tika with existing jar file without downloading again
Asked Answered
A

5

15

I'm using Tika and I realized that each time the jar file is downloaded and placed in Temp folder

Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar to C:\Users\asus\AppData\Local\Temp\tika-server.jar.
Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar.md5 to C:\Users\asus\AppData\Local\Temp\tika-server.jar.md5.

The problem is that the jar file size is around 60MB, which takes some time to download.

This is the code I'm using :

from tika import parser

def get_pdf_text(path):
    parsed = parser.from_file(path):
    return parsed['content']

The only workaround I found is this :

1 - Manually running the jar using java -jar tika-server-x.x.jar --port xxxx

2 - Using tika.TikaClientOnly = True

3 - Replacing parser.from_file(path) with parser.from_file(path, '/path/to/server')

But I don't want to run the jar file manually. It would be better if I can use Python to automatically run the jar file and setup tika with it without redownloading.

Abdel answered 12/6, 2019 at 10:20 Comment(0)
L
2

To resolve this problem you should add an environment variable to the tika server jar and specify the path folder which contains the tika jar file.

TIKA_SERVER_JAR = 'PATH_OF_FOLDER_CONTAINING_TIKA_SERVER_JAR'.

Lubricious answered 6/3, 2020 at 10:3 Comment(0)
Q
2

Here is what worked here :

os.environ['TIKA_SERVER_JAR'] = "<path_to_jar_and_md5>/tika-server.jar"
os.environ['TIKA_PATH'] = "<path_to_jar_and_md5_again>"

These are read at library import, so import the parser after, and reimport if you change them.

Quilting answered 18/2, 2022 at 11:48 Comment(0)
R
1

if you don't want to add environment variable, you can change the directory that the tika looking for tika-server.jar file with code bellow.

from tika import tika
tika.TikaJarPath = r'TIKA_SERVER_PATH'

in that TIKA_SERVER_PATH the jar file name should be tika-server.jar(the name shouldn't include the version) and also the .md5 file must be there. if the .md5 file isn't the right version as tika-server.jar this method doesn't work and tika will delete your file and download the default version.

Railroader answered 8/11, 2021 at 8:26 Comment(0)
S
0

After trying almost everything, and debugging tika.py library code I found that you must set both of these variables for this hack to work.

TIKA_SERVER_JAR="/path_to_tika_server/tika-server.jar"
TIKA_SERVER_JAR="/path_to_tika_server"

You also need to provide a .md5 signature file because since Tika version 1.18 .md5 file is not provided (sha512 signature is provided instead, see https://archive.apache.org/dist/tika/). So you need to trick the library to accept your downloaded file.

Or someone could just patch python library :)

Sneer answered 5/7, 2021 at 13:54 Comment(0)
M
0

i am wondering how to get the .md5 file of tika-server.jar, since .md5 file is not provided and sha512 signature is provided instead

Morgen answered 8/2, 2022 at 8:33 Comment(2)
If you have a new question, please ask it by clicking the Ask Question button. Include a link to this question if it helps provide context. - From ReviewDunnage
You can find them here inside required version number repo1.maven.org/maven2/org/apache/tika/tika-serverQuilting

© 2022 - 2024 — McMap. All rights reserved.