I'm using Tika and I realized that each time the jar file is downloaded and placed in Temp folder
Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar to C:\Users\asus\AppData\Local\Temp\tika-server.jar.
Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar.md5 to C:\Users\asus\AppData\Local\Temp\tika-server.jar.md5.
The problem is that the jar file size is around 60MB, which takes some time to download.
This is the code I'm using :
from tika import parser
def get_pdf_text(path):
parsed = parser.from_file(path):
return parsed['content']
The only workaround I found is this :
1 - Manually running the jar using java -jar tika-server-x.x.jar --port xxxx
2 - Using tika.TikaClientOnly = True
3 - Replacing parser.from_file(path)
with parser.from_file(path, '/path/to/server')
But I don't want to run the jar file manually. It would be better if I can use Python to automatically run the jar file and setup tika with it without redownloading.