How to use Tika in server mode
Asked Answered
H

3

29

On Tika's website it says (concerning tika-app-1.2.jar) it can be used in server mode. Does anyone know how to send documents and receive parsed text from this server once it is running?

Herzberg answered 1/9, 2012 at 21:39 Comment(0)
I
38

Tika supports two "server" modes. The simpler and original is the --server flag of Tika-App. The more functional, but also more recent is the JAX-RS JSR-311 server component, which is an additional jar.

The Tika-App Network Server is very simple to use. Simply start Tika-App with the --server flag, and a --port ### flag telling it what port to listen on. Then, connect to that port and send it a single file. You'll get back the html version. NetCat works well for this, something like java -jar tika-app.jar --server --port 12345 followed by nc 127.0.0.1 12345 < MyFileToExtract will get you back the html

The JAX-RS JSR-311 server component supports a few different urls, for things like metadata, plain text etc. You start the server with java -jar tika-server.jar, then do HTTP put calls to the appropriate url with your input document and you'll get the resource back. There are loads of details and examples (including using curl for testing) on the wiki page

The Tika App Network Server is fairly simple, only supports one mode (extract to HTML), and is generally used for testing / demos / prototyping / etc. The Tika JAXRS Server is a fully RESTful service which talks HTTP, and exposes a wide range of Tika's modes. It's the generally recommended way these days to interface with Tika over the network, and/or from non-Java stacks.

Intercrop answered 2/9, 2012 at 6:9 Comment(3)
This answer helped me a lot. And in fact, the server doesn't only return HTML. Using other options like "-j", for example, the server instead returns JSON metadata.Tunnell
You may want to use curl instead. curl -s http://localhost:9998/tika --header "Accept: text/plain" -T filename.xlsRomo
Exception in thread "main" java.lang.IllegalArgumentException: As of Tika 2.0, the server option is no longer supported in tika-app. As stated in this error, Tika app server option is removed and the only way is to run dedicated Tika server.Along
C
30

Just adding to @Gagravarr's great answer.

When talking about Tika in server mode, it is important to differentiate between two versions which can otherwise cause confusion:

  • tika-app.jar has the --server --port 9998 options to start a simple server
  • tika-server.jar is a separate component using JAX-RS

The first option only provides text extraction and returns the content as HTML. Most likely, what you really want is the second option, which is a RESTful service exposing many more of Tika's features.

You can simply download the tika-server.jar from the Tika project site. Start the server using

java -jar tika-server-x.x.jar -h 0.0.0.0

The -h 0.0.0.0 (host) option makes the server listen for any incoming requests, otherwise without it it would only listen for requests from localhost. You can also add the -p option to change the port, otherwise it defaults to 9998.

Then, once the server has started you can simply access it using your browser. It will list all available endpoints.

Finally to extract meta data from a file you can use cURL like this:

curl -T testWORD.doc http://example.com:9998/meta

Returns the meta data as key/value pairs one per line. You can also have Tika return the results as JSON by adding the proper accept header:

curl -H "Accept: application/json" -T testWORD.doc http://example.com:9998/meta

[Update 2015-01-19] Previously the comment said that tika-server.jar is not available as download. Fixed that since it actually does exist as a binary download.

Chronograph answered 18/1, 2015 at 4:27 Comment(5)
The Tika Server has been built and distributed as standard for some time now! You can find it on your nearest Apache mirror, or follow the link from the download pageIntercrop
I'd suggest you edit your answer to direct people to download the tika-app and tika-server jars from the mirrors, rather than tika-src, as it'll be much quicker and easier for them!Intercrop
I prefer this answer, it's more in-depthCoessential
is it possible to extract content from URL with tika-server? it is giving HTML back when i tried with this curl - curl #12232130 localhost:8000/tika/mainLuci
Thanks for mentioning -h option for running serverAlong
A
4

To enhance Gagravarr perfect answer:

  • If your document is got from a WEB server => curl -u "http://myserver-domain/*path-to-doc*/doc-name.extension" | nc 127.0.0.1 12345
  • And it is even better if the document is protected by password => curl -u login:*password* "http://myserver-domain/*path-to-doc*/doc-name.extension" | nc 127.0.0.1 12345
Amabel answered 18/6, 2013 at 16:15 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.