Tika server returned status: 404
Asked Answered
P

0

0

I'm trying to setup Tika for text extraction using python. I've installed Java runtime jre 1.8.0, Installed tika with pip install tika==1.23, Downloaded the tika server jar file from this link, and as mentioned in this page, I've added variable TIKA_SERVER_JAR="..tika-server-1.9.jar" to the system environment variables. I started the tika server with the command java -jar "..tika-server-1.9.jar" and I got something like below

C:\Users\Administrator>java -jar "C:\Program Files\Java\tika-server-1.9.jar"
Mar 02, 2021 4:29:07 PM org.apache.tika.server.TikaServerCli main
INFO: Starting Apache Tika 1.9 server
Mar 02, 2021 4:29:08 PM org.apache.cxf.endpoint.ServerImpl initDestination
INFO: Setting the server's publish address to be http://localhost:9998/
Mar 02, 2021 4:29:08 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: jetty-8.y.z-SNAPSHOT
Mar 02, 2021 4:29:08 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Started SelectChannelConnector@localhost:9998
Mar 02, 2021 4:29:08 PM org.apache.tika.server.TikaServerCli main
INFO: Started

When I open http://localhost:9998/ in the browser it shows me the Tika API Documentation.

But when I attempt to extract text with python as shown below.

import tika
from tika import parser
tika.initVM()

text = parser.from_file(r"..somefile.doc")
print(text)

tika doesn't work as intended. It is raising an exception like below. This is what I see on the console and nothing else.

2021-03-02 16:31:03,037 [MainThread  ] [WARNI]  Tika server returned status: 404

I once used tika with python successfully a few months back and I'm clueless about what I'm missing now.

EDITED: When I run the python snippet above, I can see verbose like below in the console.

Mar 03, 2021 9:37:08 AM org.apache.cxf.jaxrs.utils.JAXRSUtils 
findTargetMethod
WARNING: No operation matching request path "/rmeta/text" is found, Relative         
Path: /text, HTTP Method: PUT, ContentType: */*, Accept: application/json,. 
Please enable FINE/TRACE log level for more details.
Mar 03, 2021 9:37:08 AM 
org.apache.cxf.jaxrs.impl.WebApplicationExceptionMapper toResponse
WARNING: javax.ws.rs.ClientErrorException: HTTP 404 Not Found
    at org.apache.cxf.jaxrs.utils.SpecExceptions.toHttpException(SpecExceptions.java:117)
    at org.apache.cxf.jaxrs.utils.ExceptionUtils.toHttpException(ExceptionUtils.java:166)
    at org.apache.cxf.jaxrs.utils.JAXRSUtils.findTargetMethod(JAXRSUtils.java:526)
    at org.apache.cxf.jaxrs.interceptor.JAXRSInInterceptor.processRequest(JAXRSInInterceptor.java:177)
    at org.apache.cxf.jaxrs.interceptor.JAXRSInInterceptor.handleMessage(JAXRSInInterceptor.java:77)
    at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
    at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
    at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
    at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
    at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
    at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
    at org.eclipse.jetty.server.Server.handle(Server.java:370)
    at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
    at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)
    at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)
    at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)
    at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
    at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
    at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
    at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
    at java.lang.Thread.run(Unknown Source)

This is what I see on the console every time I run the python script to extract text.

Pessa answered 2/3, 2021 at 18:36 Comment(9)
Why are you starting such an old version of the Apache Tika Server jar? What happens when you fix your TIKA_SERVER_JAR variable to refer to a recent one?Molton
@Molton I'm facing the same issue with updated versions too, I tried with tika-server-1.9 also.Pessa
Apache Tika 1.9 was released in 2015! Try something a little bit more modern...Molton
I went straight to the official page of Apache tika, it shows latest stable version is tika-1.25.Pessa
So use that then! Stop using 7+ year old versions of the software and being surprised there are issues...Molton
Three months ago I used the same jar file tika-server-1.9.jar and it worked for me. And I tried with multiple versions of the jar file, but still, I get the same problem. Please go through the question once again, I've edited the question a bit just in case you get any idea what I'm missing.Pessa
You need to use matching versions of the Tika Server jar and the Tika python wrapper. The latest version is very much recommended! You are seemingly trying to use a very recent version of the Python wrapper to talk to a 7 year old version of the Server, which is unlikely to work as 7 years ago the server hadn't had many of the endpoints added...Molton
Okay I noted that point, and please check the tika's official page, it shows the latest stable version as tika-server-1.25 which is older than tika-server-1.9.Pessa
9 < 25, Apache Tika 1.9 = 1.09 was released in 2015, Apache Tika 1.25 (25th subrelease of 1) was released very recentlyMolton

© 2022 - 2024 — McMap. All rights reserved.