Apache Tika vs. Apache Lucene

Asked 10/10, 2017 at 9:26 Answered 11/8, 2018 at 1:15

I would have a question concerning analyzing documents. With Apache Tika, it is possible to get content and metadata of different files with different types.

Is it also possible to get keywords of files (i.e. stemming) with Tika or do I still need Lucene for that?

Sign answered 10/10, 2017 at 9:26 Comment(0)

I don't know if it's possible but i would recommend doing all the keyword analysis in lucene. My personal reasons:

Tika's main goal is to extract informations out of files
Lucenes defines how data are going to be analyzed and indexed. How data will be analyzed has big impact on how your lucene index performes in searches (finding stuff you expect to find)
it's kind of separation of concerns that Tika is only extracting and Lucene cares about the search relevant things

Piperidine answered 13/10, 2017 at 13:41 Comment(0)

Tika and Lucene do different things.

Tika exists to grab data out of files. For example, you can use Tika to extract the text out of a PDF.

Lucene is an indexer. So, when you provide Lucene with Doc1.txt, Doc2.txt and Doc3.txt, it will index them such that later you can search for a word or phrase like 'hello' and Lucene will respond with a list of documents that contain that word, and the number of times in each document.

If you're going to index arbitrary content, you might use Tika to first extract the text, and then Lucene to index it.

Gunning answered 11/8, 2018 at 1:15 Comment(0)

Recommended topics

Hot tags