How to detect image in a document
Asked Answered
B

2

3

How can I detect images in a document say doc,xls,ppt or pdf ?

I came across with Apache Tika, I am trying its command line option. http://tika.apache.org/1.2/gettingstarted.html

But not quite sure how it will detect images.

Any help is appreciated.

Thanks

Brindisi answered 13/8, 2012 at 10:45 Comment(2)
Do you want a purely command line solution, or are you happy to write some Java?Discontinue
@ Gagravarr I would like to have command line solution as I want to use Tika with Python.Brindisi
D
3

You've said you want to use a command line solution, and not write any Java code, so it's not going to be the prettiest way to do it... If you are happy to write a little bit of Java, and create a new program to call from Python, then you can do it much nicer!

The first thing to do is to have the Tika App extract out any embedded resources within your file. Use the --extract option for this, and have the extraction occur in a special temp directory you app controls, eg

$ java -jar tika.jar --extract ../testWORD_embedded_pdf.doc
Extracting 'image1.emf' (application/x-emf)
Extracting '_1402837031.pdf' (application/pdf)

Grab the output of the extraction if you can, and parse that looking for images (but be aware that some images have an application/ prefix on their canconical mimetype!). You might need to run a second --detect step on a few, I'm not sure, test how the parsers get on with the extraction.

Now, if there were images, they'll be in your test dir. Process them as you want. Finally, zap the temp dir when you're done with the file!

Discontinue answered 14/8, 2012 at 15:25 Comment(4)
Thanks dude, that is good. Can I get just the info of images. I do not want to extract them to a directory. Possible ?Brindisi
Yes, you can do that, but only if you write some Java code! If you want to do it only using the Tika-App command line tool, then extracting and cleaning up later is the only wayDiscontinue
I'm wondering if you could post a link or code for detecting image from file using tika library.Lentamente
@MohamadGhafourian That's an entirely different query, so you'll need to ask it as a brand new questionDiscontinue
J
0

Having used Tika in the past I can't see how Tika can help with images embedded within Office documents or PDFs I was wrong to answer No. You will have may still try to resolve to native APIs like Apache POI and Apache PDFBox. Tika does use both libraries to parse text and metadata but no embedded image support.

Using Tika makes these APIs automatically available (side effect of using Tika).

UPDATE: Since Tika 0.8: look for EmbeddedResourceHandler and examples - thanks to Gagravarr.

Jospeh answered 13/8, 2012 at 18:49 Comment(3)
Can you suggest any other tools besides Apache Tika ?Brindisi
This is incorrect, Tika handles embedded resources just fine!Discontinue
@Discontinue - thank you for correcting me - added link to appropriate Tika interface.Jospeh

© 2022 - 2024 — McMap. All rights reserved.