How do I determine programmatically if a PDF is searchable?
Asked Answered
S

1

7

I have a CSV with a list of URLs with PDFs:

  • Some of these PDFs are searchable.
  • Some of these PDFS aren't searchable.

I want to determine which PDFs are searchable from my list of PDFs. Is there an easy way to do this?

Schroeder answered 5/8, 2012 at 21:32 Comment(6)
What do you mean by searchable? That they contain text and not images?Zamindar
I haven't tried this but the first hit on Bing suggests that searching the PDF file contents for "FontName" will identify the searchable ones.Ss
That the PDF has OCRed text. I'll look into FontName.Schroeder
Yeah, strings foo.pdf | grep FontNameSchroeder
Unfortunately, grepping for "FontName" is not sufficient. I've seen many searchable PDF files apparently created from (or by) PowerPoint that have "/Font" and "/BaseFont" but not "FontName". I am currently grepping for both Fontname and BaseFont.Pulvinus
To return only files without the string "Font" you can use the -L switch in grep.Venusian
P
10

On the commandline, I'd use pdffonts to determine which fonts the file uses. This runs rather fast as well...

Example 1: PDF containing text

pdffonts bash-manpage.pdf 
  
  name                            type          encoding        emb sub uni object ID
  ------------------------------- ------------- --------------- --- --- --- ---------
  Times-Roman                     Type 1        Custom          no  no  no       8  0
  Times-Bold                      Type 1        Standard        no  no  no       9  0
  Helvetica                       Type 1        Custom          no  no  no      11  0
  Helvetica-Bold                  Type 1        Standard        no  no  no      30  0

Example 2: PDF containing only images

pdffonts scanned-book.pdf
  
  pdffonts handmade.pdf 
  name                            type           encoding       emb sub uni object ID
  ------------------------------- -------------- -------------- --- --- --- ---------

  1. Example 1 shows a table with font names. This means there IS text to search.

  2. Example 2 shows an empty table. No fonts, no text to be searched (unless you run OCR on the file to first embed any found text... but then you've created a different file!), don't look back at these...

Note: to be successful in actually extracting the embedded text and hence being able to search it is an entirely different problem. There are many cases where you'll find it to be extremely difficult -- especially if you see in the fonts' table font types like CID Type with 'custom' encoding. You may first want to search stackoverflow for other questions that were asked about text extraction from PDF...

Playacting answered 5/8, 2012 at 22:14 Comment(4)
use pdffonts to determine which fonts the file uses - does that tool really check if the fonts are used? Or does it only check whether they are defined as resources? If the latter is the case, the presence of fonts is not a 100% sure indication of searchable text.Devonian
@mkl: If you want 100% sure indications about PDFs analysed programmatically + automatically, go to a different universe. You can't have that here. Here we only handle up to 99% sure indications. I would be able to hand-craft a PDF that shows you "You're in Heaven" text on the page, but extracts you "You're in Hell" if you handle it programmatically. More than 99.99% of real world PDFs in this universe are programmatically created by tools which do not output this type of nonsense, and which do not embed fonts that are never used.Playacting
Correct. I merely wanted to point out that it only is likely that a provided font is used, it is not a sure thing. Being sure of anything in PDFs is not trivial.Devonian
How to install pdffonts? Also is there any way in which I can check by writing a python scriptRyannryazan

© 2022 - 2024 — McMap. All rights reserved.