Difference between Apache POI api and Apache Tika Api?
Asked Answered
I

2

7

I had requirement to extract specific colums/rows from Excel/CSV file. Somebody suggest me to using Tika for this task.

While going thru tika, I came across POI API and found more friendly to use it.

we may have requirement to parse PDF file in further.

I am new to this technology, i would like know difference between two and which technology is more suitable for my requirement.

Thanks, Krishna

Inception answered 19/9, 2013 at 6:47 Comment(1)
did you check the tag info those you have tagged to your question?Litigate
P
21

Apache Tika provides a common way to extract consistent text and metadata from a wide range of formats. It also provides content detection, language detection and a few other bits. If you write your code to work with Apache Tika, then your code will be able to work with a huge range of formats in the same way. You don't need to worry about whether one format has a Title, or another calls the same logical thing a LongTitle or a Subject. You don't need to worry about what library to use for what format. You call Tika, it does the hard work for you, and back comes your consistent Metadata and Textual Content

Apache POI is one of the libraries that Tika uses. POI supports most of the main Microsoft Office formats, including Excel (.xls and .xlsx). It provides access to the whole of the file format, allowing you complete control over what information you read out. (It also supports writing). Tika uses POI to get text and metadata out of the various different Microsoft formats, but doesn't extract everything. Using POI directly would allow you to decide what you care about and get that.

If you want to support lots of file formats, use Tika. If you want full control of how you get the information out, use POI.

Pullman answered 19/9, 2013 at 14:28 Comment(1)
@ha9u63ar 20 seconds on Google or clicking the link in the answer would've found you the Apache Tika supported formats page, which tells you exactly what is supported, including Word formats....Pullman
L
1

Apache POI is full blown parser/writer for most of the Microsoft Documents. It supports both newly introduced 2007 (XSSF) format and Microsoft 2003 file formats (HSSF). Apache POI provides two level of API for parsing and generating Microsoft files. One that is higher level API that is bit memory intensive which reads the whole file and keeps in the memory something similar to DOM parsing in XML and lower level API for memory intensive use which is similar to SAX/StAX parsing.

On the other hand Apache Tika is content analysis tool which I guess only supports Microsoft Excel and lot of other extraction components. There is no support for writing new files or generating content from Tika, anyway that is not the their use case at all.

So, you have to choose depending on your need.

Live answered 19/9, 2013 at 7:1 Comment(1)
I want to parse PDF, Word+Excel (2003 - 2007), PPT, CSV, and .txt files. I know that PDF, TXT, and JPG files are okay without any dependencies, but I am getting Errors constantly when using .docx and .doc filesBicentennial

© 2022 - 2024 — McMap. All rights reserved.