C/C++ alternative to Apache Tika
Asked Answered
D

2

10

I am looking for a C/C++ alternative for Apache Tika framework which is Java based. Specifically, I am searching for file meatadata and structured text extraction all under one framework. After some online searching and browsing the closest thing I have is GNU libextractor and a bunch of individual file filters that parse documents to extract text data (pdftoext, xls2csv ..etc)

Can anyone please recommend a good library comparable to Apache's Tika ?

Thanks

Dividend answered 3/6, 2011 at 22:11 Comment(0)
L
2

KDE provides a library called KFileMetaData which they internally use for their file indexer.

It uses C++, Qt5 and supports most of the basic formats such as - ms-office-2007, odfs, pdfs, images, video, audio and ebooks.

Leflore answered 27/4, 2015 at 13:48 Comment(1)
Note: This library just shells other programs, like catdoc to get text.Cleft
L
1

Tika has a network server mode, so you could always start Tika using that and then send it requests from your C++ code?

Alternately, Tika has a CLI mode, so you could fire off a new Tika process each time and read the data from the pipe.

Lamonica answered 4/6, 2011 at 6:12 Comment(4)
This is a nice idea in theory, but has it ever been documented? Understanding the server mode may require some digging through code and discussion groups. Documentation seems to be a bit of a problem on the Tika project, which is unfortunate, because it looks to be a comprehensive tool.Lienlienhard
Probably only documented in code for now, as it's under active development. If you're interested, best bet is to ask on the mailing list, that might prod one of the committers who look after it to write up some docs :)Lamonica
For anyone coming to this in future, the question has now been asked on the Tika users list - long term that thread will hopefully contain the right answer!Lamonica
That was me - I'll follow it through, and if I need to write up some docs, will link it back to here also. Thanks for linking. It makes sense that questions asked in lots of places ultimately lead to the answer somewhere.Lienlienhard

© 2022 - 2024 — McMap. All rights reserved.