Apache Tika and File access instead of Java Input Stream
Asked Answered
I

2

6

I want to be able to create a new Tika parser to extract metadata from a file. We're already using Tika and the metadata extraction will be done consistently.

I think that I've run into this problem/enhancement request for Tika:

Allow passing of files or memory buffers to parsers

I have a console c++ executable that accepts the path to a file on input and then outputs the metadata that it finds, each line consisting of name/value pairs.
The c++ code relies on libraries that expect a file path when accessing the data. It's not going to be possible to rewrite this executable in Java. I thought that it would be fairly easy to plug this into Tika. But the Tika parser needs to be in Java and the Tika parser method that needs to be overridden takes an open input stream:

void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)

So I guess that my only solution will be to take the input stream and write it to a temporary file and then to process the file that gets written and to then finally clean up the file. I hate messing with a temporary file and then potentially having to worry about cleanup of temp files should something go wrong and it doesn't get deleted.

Does anyone have a clever idea about how to cleanly deal with something like this?

Inexplicit answered 17/5, 2011 at 21:32 Comment(0)
H
7

There's TikaInputStream which should help. It handles wrapping a File or an InputStream, and converting between them internally as parsers require. It does all the temp file bits as needed for you.

Several Java parsers already make use of it because they need a File rather than an Input Stream. What's more, users who have a file can pass it to the Parser wrapped as an InputStream, and the parser can read it as either a File or an InputStream as their needs suit.

So, I'd suggest you just turn the InputStream into a TikaInputStream (which is just a cast if it's already one), then get the file and pass that to your c++.

Houseroom answered 17/5, 2011 at 22:36 Comment(1)
Thanks. I was looking for something like that.Inexplicit
I
1

If I understand correctly and assuming you're launching the C++ program using Runtime.exec, you could parse the Processs standard output stream as the InputStream that Tika wants. Would that work?

Ildaile answered 17/5, 2011 at 21:46 Comment(2)
I don't think that will work, but I don't know Tika well enough to say for sure.Inexplicit
You and me both: but I know you can do a Process proc = Runtime.getRuntime().exec(cmd); InputStream is = proc.getInputStream(); and you'll have the output of the process available. You can wrap it in a BufferedInputStream and see what it does for you. Not sure how Tika would be able to tell the difference between that and any other InputStream.Ildaile

© 2022 - 2024 — McMap. All rights reserved.