How to detect image in a document

Asked 13/8, 2012 at 10:45 Answered 14/8, 2012 at 15:25

Solved apache apache-tika

How can I detect images in a document say doc,xls,ppt or pdf ?

I came across with Apache Tika, I am trying its command line option. http://tika.apache.org/1.2/gettingstarted.html

But not quite sure how it will detect images.

Any help is appreciated.

Thanks

Brindisi answered 13/8, 2012 at 10:45 Comment(2)

Do you want a purely command line solution, or are you happy to write some Java? – Discontinue 13/8, 2012 at 11:45

@ Gagravarr I would like to have command line solution as I want to use Tika with Python. – Brindisi 14/8, 2012 at 9:21

You've said you want to use a command line solution, and not write any Java code, so it's not going to be the prettiest way to do it... If you are happy to write a little bit of Java, and create a new program to call from Python, then you can do it much nicer!

The first thing to do is to have the Tika App extract out any embedded resources within your file. Use the --extract option for this, and have the extraction occur in a special temp directory you app controls, eg

$ java -jar tika.jar --extract ../testWORD_embedded_pdf.doc
Extracting 'image1.emf' (application/x-emf)
Extracting '_1402837031.pdf' (application/pdf)

Grab the output of the extraction if you can, and parse that looking for images (but be aware that some images have an application/ prefix on their canconical mimetype!). You might need to run a second --detect step on a few, I'm not sure, test how the parsers get on with the extraction.

Now, if there were images, they'll be in your test dir. Process them as you want. Finally, zap the temp dir when you're done with the file!

Discontinue answered 14/8, 2012 at 15:25 Comment(4)

Thanks dude, that is good. Can I get just the info of images. I do not want to extract them to a directory. Possible ? – Brindisi 16/8, 2012 at 10:57

Yes, you can do that, but only if you write some Java code! If you want to do it only using the Tika-App command line tool, then extracting and cleaning up later is the only way – Discontinue 16/8, 2012 at 11:11

I'm wondering if you could post a link or code for detecting image from file using tika library. – Lentamente 24/11, 2013 at 12:57

@MohamadGhafourian That's an entirely different query, so you'll need to ask it as a brand new question – Discontinue 24/11, 2013 at 18:23

Having used Tika in the past ~~I can't see how Tika can help with images embedded within Office documents or PDFs~~ I was wrong to answer No. You ~~will have~~ may still try to resolve to native APIs like Apache POI and Apache PDFBox. Tika does use both libraries to parse text and metadata but no embedded image support.

Using Tika makes these APIs automatically available (side effect of using Tika).

UPDATE: Since Tika 0.8: look for EmbeddedResourceHandler and examples - thanks to Gagravarr.

Jospeh answered 13/8, 2012 at 18:49 Comment(3)

Can you suggest any other tools besides Apache Tika ? – Brindisi 14/8, 2012 at 9:21

This is incorrect, Tika handles embedded resources just fine! – Discontinue 14/8, 2012 at 10:49

@Discontinue - thank you for correcting me - added link to appropriate Tika interface. – Jospeh 14/8, 2012 at 23:6

Recommended topics

Hot tags