How to know if a PDF contains only images or has been OCR scanned for searching?
Asked Answered
B

8

32

I have a bunch of PDF files that came from scanned documents. The files contain a mix of images and text. Some were scanned as images with no OCR, so each PDF page is one large image, even where the whole page is entirely text. Others were scanned with OCR and contain images and searchable text where text is present. In many cases even words in the images were made searchable.

I want to make an automated process to recognize the text in all of the scanned documents using OCR, with Acrobat 8 Pro, but I don't want to re-OCR the files that have already been through the OCR process in the past. Does anyone know if there is a way to tell which ones contain only images, and which ones already contain searchable text?

I'm planning on doing this in C# or VB.NET but I don't think being able to tell the two kinds of files apart is language dependent.

Backset answered 28/9, 2009 at 22:45 Comment(0)
S
28

Scannned images converted to PDF which have been OCR'ed in the aftermath to make text searchable do normally contain the text parts rendered as "invisible". So what you see on screen (or on paper when printed) is still the original image. But when you search successfully, you get the hits highlighted that are on the invisible text.

I'd recommend you to look at the XPDF-derived commandline tools pdffonts(.exe), pdfinfo(.exe) and pdftotext(.exe). See here for downloads: http://www.foolabs.com/xpdf/download.html

Example usage of pdffonts:

C:\downloads\> pdffonts cisco-ip-phone-7911-guide6.1.pdf
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
LGOKFL+Univers-BlackOblique          Type 1C           yes yes no   13171  0
LGOKGM+Univers-Black                 Type 1C           yes yes no   13172  0
[....]

This PDF uses fonts (indicated by the 'name' column), has them embedded (indicated by the 'yes' in the 'emb' column) and uses subset fonts (indicated by the 'yes' in the 'sub' column).

C:\downloads\> pdffonts examle1.pdf
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
Univers-BlackOblique                 Type 1C           yes no  no   14    0
Arial                                TrueType          no  no  no   15    0

This PDF uses 2 fonts (indicated by the 'name' column). The font 'Universe-BlackOblique' is embedded completely (indicated by the 'yes' in the 'emb' column and the 'no' in the 'sub' column). The font 'Arial' is also used, but is not embedded.

C:\downloads\> pdffonts examle2.pdf
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------

This PDF uses not a single font, and hence does not have any text embedded (so no OCR either).

Example usage of pdftotext:

C:\downloads\> pdftotext ^
                   -layout ^
                   cisco-ip-phone-7911-guide6.1.pdf ^
                   cisco-ip-phone-7911-guide6.1.txt

This will extract all text strings from the PDF (trying to preserve some resemblance of the original layout). If there is no text in the PDF, you'd know there was no OCR...

Sikorsky answered 24/6, 2010 at 9:8 Comment(9)
I tried your approach but for some scanned pdffile "pdffonts" command still returning Helvetica font? Can you explain or guide me how can I achieve this more accurately.ThanksAnagoge
@DanglingPiyush: Without a sample of such a Scan-PDF file I'm not able to tell you were the Helvetica comes from. Can you provide a sample page that shows this behavior?Sikorsky
fileconvoy.com/… This is the link to sample pdf it is containing only scanned images but pdffonts shows Helvectica Font.Please have a look at it.Anagoge
:Have you looked at it?Anagoge
@DanglingPiyush: This file contains a /Font object that is not really used anywhere in the file. (My theory about the cause of this is that the PDF creating software {the file's metadata calls it "Canon"} was set up to apply OCR, and this software uses Helvetica as its default OCR font, but in didn't identify any OCR-able text...)Sikorsky
Thanks alot for pointing this out,Can you guide me about How can I deal with such kind of files? And what tool you have used for extracting above mentioned information? so that in future it will be helpful for me.ThanksAnagoge
@DanglingPiyush: I basically used two items: (1) A command: qpdf --qdf --object-streams=disable in order to de-compress (most) binary PDF objects and make the resulting file's PDF source code easily viewable/editable in a text editor. (2) The official PDF specification in order to understand the PDF source code.Sikorsky
@DanglingPiyush: You should check your scanner and its software if it provides a setting for you to disable automatic OCR of scanned pages.Sikorsky
I know this is a very old post, but now have the same question. I was wondering if you can give some pointers on how to use the command tools at the link you gave? I'm afraid I don't normally use command tools (but am keen to learn) however I can't understand the doucmentation at the website. I think it assumes the user already know how to work with these tools. I work on a Mac and have used Terminal and som very basic shell comands... so any other pointers would be very helpful! thanksMekong
I
0

Various PDF tools can tell you if there's text. Some are available as COM controls, and maybe even native .NET ones.

Infrequent answered 28/9, 2009 at 23:0 Comment(1)
Can you recommend one that you know works, or that I should try?Backset
P
0

Open the document in acrobat. Go to File -> Properties. Look in the "Advanced" section and find the PDF Producer. If it reads something like "Paper Capture..." then it has been OCR'd.

Hope this helps.

Plasmasol answered 22/4, 2010 at 18:10 Comment(1)
Right, in my sample sets, the image based PDFs have a blank PDF Producer, but the ones that were OCR'd show, "Adobe Acrobat 8.16 Paper Capture Plug-in." But I found another one that has selectable text and the producer is, "Acrobat Distiller 5.0.5 (Windows)." And another with text, "createpdf.adobe.com v5.1." Others with text "Microsoft Office Word 2007" and "GPL Ghostscript 8.54." It seems like the producer is blank for image based PDFs but some other value for PDFs that contain text.Backset
M
0

I use Everything by VoidTools to do a regex content search on the PDF's. Any pdf with absolutely no text is a good candidate.

e.g. .pdf regex:content:^$ This searches for all files with .pdf in the name, and that has empty content (^$ means: a start of a line and and and of a line with nothing in between), alternatively regex:content:^(?![\s\S]))

Manageable answered 10/4, 2022 at 9:23 Comment(0)
S
-1

Apago's pdfspy extracts information from PDF into an XML file. It includes information about the document including images and text. For your project, the useful information includes image count & size and where there is OCR (hidden) text.

http://www.apagoinc.com/pdfspy

Squish answered 28/12, 2009 at 12:3 Comment(0)
D
-1

Sorry to dig up old thread, but if you found this have a look at my thread:

Batch OCR Program for PDFs

you can get extra information about the pdf by catting it in unix/linux/osx or opening it as "rb" mode in python. (course that's python and you didn't want to use that but maybe it has something equivalent).

Drag answered 1/7, 2011 at 20:45 Comment(0)
L
-3

Use "dtsearch" to create an index for all the pdf files... then "view the log file" of the indexing process to check the list of pdf files that were not indexed.

Ladon answered 25/4, 2016 at 1:49 Comment(0)
R
-4

A very low tech solution: any file that has scanned text will undoubtedly contain the letter "a" so do a search on all file contents that don't contain the letter a. i.e. "NOT a". Any file that shows up won't have been OCR'd

Redmund answered 22/1, 2014 at 11:40 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.