Use tika with python, runtimeerror: unable to start tika server
Asked Answered
C

7

30

I am trying to use the tika package to Parse files. Tika is successfully installed, tika-server-1.18.jar runned with Code in cmd Java -jar tika-server-1.18.jar

My code in the Jupyter is:

import tika 
from tika import parser
parsed = parser.from_file('')

However, I receive below error:

2018-07-25 10:20:13,325 [MainThread ] [WARNI] Failed to see startup log message; retrying... 2018-07-25 10:20:18,329 [MainThread ] [WARNI] Failed to see startup log message; retrying... 2018-07-25 10:20:23,332 [MainThread ] [WARNI] Failed to see startup log message; retrying... 2018-07-25 10:20:28,340 [MainThread ] [ERROR] Tika startup log message not received after 3 tries. 2018-07-25 10:20:28,340 [MainThread ] [ERROR] Failed to receive startup confirmation from startServer.

RuntimeError: Unable to start Tika Server.

Cockatrice answered 25/7, 2018 at 8:28 Comment(4)
Any update to this question? I get the same error message.Fsh
I gave up using TIKA Server, instead, I used TikaApp to solve the problem. "tika_client = TikaApp(file_jar = ''(where i have stored the tika app). It works. For parser I haven't found a solution, unfortunately.Cockatrice
using TikaApp, tika_client.extract_all_content(path_to_file) returns an empty stringNitid
This answer has solved my problem. https://mcmap.net/q/88972/-how-can-i-use-tika-package-https-github-com-chrismattmann-tika-python-in-python-2-7-to-parse-pdf-filesSidelong
P
16

According to Apache Tika's site, all new versions of the tika-server.jar will require Java 8.

24 April 2018: Apache Tika Release Apache Tika 1.18 has been released! This release includes bug fixes (e.g. extraction from grouped shapes in PPT), security fixes and upgrades to dependencies. PLEASE NOTE: The next versions will require Java 8. Please see the CHANGES.txt file for the full list of changes in the release and have a look at the download page for more information on how to obtain Apache Tika 1.18.

Current outdated docs for tika Python library claim that Java 7 is needed, but now Java 8 must be installed. This is because the current version of tika-server.jar is automatically downloaded at runtime if not found in your temp file.

After installing Java 8, my basic test code launched the server and worked without error.

Parhe answered 6/11, 2018 at 15:26 Comment(2)
It's not necessary to install java while working with Apachetika.Lumberjack
I'm stuck in the same issue, I've posted a question too, could you please check this question. Seems like tika server is not starting, it returns Not Found 404.Hendecasyllable
H
10

After you import Tika you need to initialize the Java Server

import tika
tika.initVM()
from tika import parser
parsed = parser.from_file('') //file name should be here
Haematic answered 25/6, 2020 at 21:44 Comment(1)
I tried the same way, I initialized it after import. It returns Not Founds 404. Please check this link.Hendecasyllable
T
3

Download Java. If you already have a version of Java installed, try updating it to the latest version. The version that works for me is 1.18.

Trichromatic answered 5/7, 2019 at 5:47 Comment(0)
G
1

You have not passed an argument (specified a file) in your line:

parsed = parser.from_file('')

Give it a file to chew on e.g.,

parsed = parser.from_file('myfile.txt')

The server didn't start & presumably this no log warning gets triggered - see line 644 in the source at the Github

then another error message tells you it ain't going to play...

Gav answered 9/8, 2018 at 12:3 Comment(0)
C
1

I faced similar issue. Tried all steps mentioned here, nothing helped. How I solved it:

  1. checked the log file of tika and tika-server. For windows, you can find it inside C:/Users/your_user_name/AppData/Local/Temp/
  2. Found that tika-server log had mentioned port already in use error.

check below log snippet -

INFO: Setting the server's publish address to be http://localhost:9998/
WARNING: FAILED SelectChannelConnector@localhost:9998: java.net.BindException: Address already in use: bind
java.net.BindException: Address already in use: bind
        at sun.nio.ch.Net.bind0(Native Method)
        at sun.nio.ch.Net.bind(Unknown Source)
        at sun.nio.ch.Net.bind(Unknown Source)
        at sun.nio.ch.ServerSocketChannelImpl.bind(Unknown Source)
        at sun.nio.ch.ServerSocketAdaptor.bind(Unknown Source)
        at org.eclipse.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187)
        at org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316)
        at org.eclipse.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265)
        at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
        at org.eclipse.jetty.server.Server.doStart(Server.java:293)
        at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
        at org.apache.cxf.transport.http_jetty.JettyHTTPServerEngine.addServant(JettyHTTPServerEngine.java:417)
        at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.activate(JettyHTTPDestination.java:179)
        at org.apache.cxf.transport.AbstractObservable.setMessageObserver(AbstractObservable.java:49)
        at org.apache.cxf.binding.AbstractBindingFactory.addListener(AbstractBindingFactory.java:95)
        at org.apache.cxf.jaxrs.JAXRSBindingFactory.addListener(JAXRSBindingFactory.java:88)
        at org.apache.cxf.endpoint.ServerImpl.start(ServerImpl.java:123)
        at org.apache.cxf.jaxrs.JAXRSServerFactoryBean.create(JAXRSServerFactoryBean.java:206)
        at org.apache.tika.server.TikaServerCli.main(TikaServerCli.java:213)
  1. This clearly indicated that another process is already running in same port. So I just needed to kill java process running on port 9998 (which I assumed might have been defunct)
  2. Once I killed the process in task manager, I tried rerunning the python script, it worked correctly.
  3. To cross check you can also run the tika-server.jar file present in same path - C:/Users/your_user_name/AppData/Local/Temp/ using below command and check if it fails or runs correctly: java -jar tika-server.jar

Hope this will be helpful to someone in future.

Curt answered 12/4, 2021 at 10:52 Comment(0)
L
1

I got the same error and solved using steps below:

  1. Check my tika server log file (usually its located at C:/Users/your_user_name/AppData/Local/Temp/)

    2023-04-02 08:06:47,277 [Thread-1 (pr] [ERROR] Unable to run java; is it installed? 2023-04-02 08:06:47,278 [Thread-1 (pr] [ERROR] Failed to receive startup confirmation from startServer.

  2. It is suspected Java is not being installed. So check if Java is being installed using

    java -version

  3. If it's not installed, you may download it here: https://www.java.com/en/download/.

  4. If still error, try to start Tika server manually using:

    java -jar tika-server.jar

  • Remember run it at where your jar file is located. Now it should work.
Leopold answered 2/4, 2023 at 0:35 Comment(0)
P
0

If your are using Ubuntu 20.01 (and 18.04) like me, the solution is to Install Oracle JDK 17. Do the following:

sudo add-apt-repository ppa:linuxuprising/java
sudo apt update
sudo apt install oracle-java17-installer

Type java -version on the terminal. You should see the following print-out:

java version "17.0.1" 2021-10-19 LTS`
Java(TM) SE Runtime Environment (build 17.0.1+12-LTS-39)`
Java HotSpot(TM) 64-Bit Server VM (build 17.0.1+12-LTS-39, mixed mode, sharing)

tika should then be able to extract text from your pdf in python.

parser.from_file(<your pdf file>)
Pickwickian answered 31/12, 2021 at 11:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.