How to check if PDF is scanned image or contains text
Asked Answered
D

13

44

I have a large number of files, some of them are scanned images into PDF and some are full/partial text PDF.

Is there a way to check these files to ensure that we are only processing files which are scanned images and not those that are full/partial text PDF files?

environment: PYTHON 3.6

Decoupage answered 16/4, 2019 at 8:54 Comment(5)
What is your goal with these PDF's? Do you want to extract text or want to extract text from Image?Macrobiotic
I want to extract data from both image data and text dataDecoupage
For a similar, but slightly different question: How can I distinguish a digitally-created PDF from a searchable PDF?Fonsie
The wording of this question is not logical to me at all. Many PDFs have both scanned images and text. These can either be text layers on scanned images (like what ocrmypdf generates) or they can be documents with independent elements of text and images (like if someone prints a Word document with images to PDF). I think the question is meant to ask about seperating those PDFs with text from those without text, and the answers all do different things that may or not be related. Clarifying what the actual question is would be helpful.Aronarondel
I have come across this question as it's exactly what I'm searching for but seems many people don't understand what's the goal with this question. For my goal to differentiate the scanned and non-scanned is because I want to use a pdf-extract package for non-scanned as it's fast and accurate, but to use ocr on scanned versions. This needs to be automated without user input, so differentiating between the two is important.Spiros
M
23

The below code will work, to extract data text data from both searchable and non-searchable PDF's.

import fitz

text = ""
path = "Your_scanned_or_partial_scanned.pdf"

doc = fitz.open(path)
for page in doc:
    text += page.get_text()()

You can refer this link for more information.

If you don't have fitz module you need to do this:

pip install --upgrade pymupdf

Macrobiotic answered 16/4, 2019 at 11:41 Comment(3)
Thanks for the reply but my question was if a user upload a pdf document how will i check whether it is a scanned document or text document. @Rahul AgarwalDecoupage
That is why I asked, what is the intent? Y do u want to check? If it is to extract text, whatever it is the above answer works!! :)Macrobiotic
But the accuracy is very poor! Most of the scanned pdfs I have, it is unable to extract, so I had to convert the pdfs to images first and then process those images one at a time!Mateusz
J
22

Building on top of Rahul Agarwal's solution, along with some snippets I found at this link, here is a possible algorithm that should solve your problem.

You need to install fitz and PyMuPDF modules. You can do it by means of pip.

The following code has been tested with Python 3.7.9 and PyMuPDF 1.16.14. Moreover, it is important to install fitz BEFORE PyMuPDF, otherwise it provides some weird error about a missing frontend module (no idea why). So here is how I install the modules:

pip3 install fitz
pip3 install PyMuPDF==1.16.14

And here is the Python 3 implementation:

import fitz


def get_text_percentage(file_name: str) -> float:
    """
    Calculate the percentage of document that is covered by (searchable) text.

    If the returned percentage of text is very low, the document is
    most likely a scanned PDF
    """
    total_page_area = 0.0
    total_text_area = 0.0

    doc = fitz.open(file_name)

    for page_num, page in enumerate(doc):
        total_page_area = total_page_area + abs(page.rect)
        text_area = 0.0
        for b in page.getTextBlocks():
            r = fitz.Rect(b[:4])  # rectangle where block text appears
            text_area = text_area + abs(r)
        total_text_area = total_text_area + text_area
    doc.close()
    return total_text_area / total_page_area


if __name__ == "__main__":
    text_perc = get_text_percentage("my.pdf")
    print(text_perc)
    if text_perc < 0.01:
        print("fully scanned PDF - no relevant text")
    else:
        print("not fully scanned PDF - text is present")

Although this answers your question (i.e. distinguish between fully scanned and full/partial textual PDFs), this solution is not able to distinguish between full-textual PDFs and scanned PDFs that also have text within them (e.g. this is the case for scanned PDFs processed by OCR sofware - such as pdfsandwich or Adobe Acrobat - that adds "invisible" text blocks on top of the image, so that you can select the text).

Jackfish answered 29/1, 2020 at 11:32 Comment(6)
with some really scanned documents it returns 1.0Tiloine
Thanks for this, seems like an interesting approach. However, I think it might no longer work as intended. Firstly you have to change getTextBlocks() to get_text_blocks() and secondly, after testing with a PDF that only contains an image, it still says text is present and returns a value around ~0.606 which doesn't seem right.Noncombatant
Just noticed get_text_blocks() is deprecated as well. Maybe it would be better to just use get_text("words") which gets you a list of words or an empty list if there are no words. This might be helpful to answer the question with a simple if clause.Noncombatant
I've just tested and it seems to work. I've added some notes about the versions of Python and PyMuPDF that I used, as well as some notes about installation. Hope this clarifies why the above code might not work with different versions, as well as with scanned PDFs that also have been processed by OCR softwareJackfish
Running this on a document returns a value greater than 1.0Ectoblast
@FrancescoPettini that might be due to text boxes being wider or larger than the page's MediaBox (i.e. page.rect)Jackfish
L
12
def get_pdf_searchable_pages(fname):
    # pip install pdfminer
    from pdfminer.pdfpage import PDFPage
    searchable_pages = []
    non_searchable_pages = []
    page_num = 0
    with open(fname, 'rb') as infile:

        for page in PDFPage.get_pages(infile):
            page_num += 1
            if 'Font' in page.resources.keys():
                searchable_pages.append(page_num)
            else:
                non_searchable_pages.append(page_num)
    if page_num > 0:
        if len(searchable_pages) == 0:
            print(f"Document '{fname}' has {page_num} page(s). "
                  f"Complete document is non-searchable")
        elif len(non_searchable_pages) == 0:
            print(f"Document '{fname}' has {page_num} page(s). "
                  f"Complete document is searchable")
        else:
            print(f"searchable_pages : {searchable_pages}")
            print(f"non_searchable_pages : {non_searchable_pages}")
    else:
        print(f"Not a valid document")


if __name__ == '__main__':
    get_pdf_searchable_pages("1.pdf")
    get_pdf_searchable_pages("1Scanned.pdf")

Output:

Document '1.pdf' has 1 page(s). Complete document is searchable
Document '1Scanned.pdf' has 1 page(s). Complete document is non-searchable
Laurentian answered 20/12, 2019 at 6:58 Comment(0)
L
10

You can use pdfplumber. If the following code returns "None", it's a scanned pdf otherwise it's searchable.

    pip install pdfplumber

    with pdfplumber.open(file_name) as pdf:
        page = pdf.pages[0]
        text = page.extract_text()
        print(text)

To extract text from scanned pdf, you can use OCRmyPDF. Very easy package, one line solution. You can find more on the package here and a video explaining an example here. Upvote the answer if helpful. Good luck!

Liebfraumilch answered 5/1, 2021 at 23:59 Comment(1)
Actually thats the problem in some case when the PDF is scanned, text can be recognized, selected, and pdfplumber can extract text from it, which is usually crap. I think other stuffs (like pdfminer) works similar.Bibb
G
9

Try OCRmyPDF. You can use this command to convert a scanned pdf to digital pdf.

ocrmypdf input_scanned.pdf output_digital.pdf

If the input pdf is digital the command will throw an error "PriorOcrFoundError: page already has text!".

import subprocess as sp
import re

output = sp.getoutput("ocrmypdf input.pdf output.pdf")
if not re.search("PriorOcrFoundError: page already has text!",output):
   print("Uploaded scanned pdf")
else:
   print("Uploaded digital pdf")
Gerta answered 29/11, 2019 at 4:12 Comment(0)
D
9

I created a script to detect whether a PDF was OCRd. The main idea: In OCRd PDFs is the text invisible.

Algorithm to test whether a given PDF (f1) was OCRd:

  1. create a copy of f1 noted as f2
  2. delete all text on f2
  3. create images (PNG) for all (or just a few) pages for f1 and f2
  4. f1 was OCRd if all the images of f1 and f2 are identical.

https://github.com/jfilter/pdf-scripts/blob/master/is_ocrd_pdf.sh

#!/usr/bin/env bash
set -e
set -x

################################################################################
# Check if a PDF was scanned or created digitally, works on OCRd PDFs
#
# Usage:
#   bash is_scanned_pdf.sh [-p] file
#
#   Exit 0: Yes, file is a scanned PDF
#   Exit 99: No, file was created digitally
#
# Arguments:
#   -p or --pages: pos. integer, only consider first N pages
#
# Please report issues at https://github.com/jfilter/pdf-scripts/issues
#
# GPLv3, Copyright (c) 2020 Johannes Filter
################################################################################

# parse arguments
# h/t https://mcmap.net/q/40490/-how-do-i-parse-command-line-arguments-in-bash
max_pages=-1
# skip over positional argument of the file(s), thus -gt 1
while [[ "$#" -gt 1 ]]; do
  case $1 in
  -p | --pages)
    max_pages="$2"
    shift
    ;;
  *)
    echo "Unknown parameter passed: $1"
    exit 1
    ;;
  esac
  shift
done

# increment to make it easier with page numbering
max_pages=$((max_pages++))

command_exists() {
  if ! [ -x $($(command -v $1 &>/dev/null)) ]; then
    echo $(error: $1 is not installed.) >&2
    exit 1
  fi
}

command_exists mutool && command_exists gs && command_exists compare
command_exists pdfinfo

orig=$PWD
num_pages=$(pdfinfo $1 | grep Pages | awk '{print $2}')

echo $num_pages

echo $max_pages

if ((($max_pages > 1) && ($max_pages < $num_pages))); then
  num_pages=$max_pages
fi

cd $(mktemp -d)

for ((i = 1; i <= num_pages; i++)); do
  mkdir -p output/$i && echo $i
done

# important to filter text on output of GS (tmp1), cuz GS alters input PDF...
gs -o tmp1.pdf -sDEVICE=pdfwrite -dLastPage=$num_pages $1 &>/dev/null
gs -o tmp2.pdf -sDEVICE=pdfwrite -dFILTERTEXT tmp1.pdf &>/dev/null
mutool convert -o output/%d/1.png tmp1.pdf 2>/dev/null
mutool convert -o output/%d/2.png tmp2.pdf 2>/dev/null

for ((i = 1; i <= num_pages; i++)); do
  echo $i
  # difference in pixels, if 0 there are the same pictures
  # discard diff image
  if ! compare -metric AE output/$i/1.png output/$i/2.png null: 2>&1; then
    echo " pixels difference, not a scanned PDF, mismatch on page $i"
    exit 99
  fi
done
Doubleminded answered 10/4, 2020 at 22:2 Comment(1)
The scripts returns 1 without any other messagesEctoblast
C
7

How about the PDF metadata check on '/Resources' ?!

I believe for any text in a PDF (electronic document) there are more chances of having a font, especially the PDF, whose objective is to make a portable file, so, it maintains the font definition.

If you are a pypdf user, try

from pypdf import PdfReader

reader = PdfReader(input_file_location)
page = reader.pages[page_num]

page_resources = page["/Resources"]

if "/Font" in page_resources:
    print(
        "[Info]: Looks like there is text in the PDF, contains:",
        page_resources.keys(),
    )
elif len(page_resources.get("/XObject", {})) != 1:
    print("[Info]: PDF Contains:", page_resources.keys())

    x_object = page_resources.get("/XObject", {})

    for obj in x_object:
        obj_ = x_object[obj]
        if obj_["/Subtype"] == "/Image":
            print("[Info]: PDF is image only")
Coworker answered 11/11, 2019 at 16:14 Comment(2)
This is not a good solution in general. If a PDF contains text, it must contain a font, but if a PDF doesn't contain text, it can still contain fonts. The presence of fonts is an indicator of there being text, but not a guarantee.Autocratic
I found a pdf that has text but doesn't have /Font. It has full page of selectable text but under resources there is only generation, idnum and pdf.Pannikin
B
3

None of the posted answers worked for me. Unfortunately, the solutions often detect scanned PDFs as textual PDFs, most often because of the media boxes present in the documents.

As funny as it may look, the following code proved to be more accurate for my use-case:

extracted_text = ''.join([page.getText() for page in fitz.open(path)])
doc_type = "text" if extracted_text else "scan"

Make sure to install fitz and PyMuPDF beforehand, though:

pip install fitz PyMuPDF

Brazilin answered 15/9, 2021 at 16:1 Comment(2)
Please explain where the keyword "text" comes from and what else can be used? What are the results from testing on scanned in text+image vs norm PDF text with image?Abscind
If the PDF contains any actual text at all (as opposed to text in a scanned image), doc_type will be "text". It doesn't signify anything more than that. If you test this code on a scanned PDF, extracted_text will be empty, so doc_type = "scan"Brazilin
G
2

Just I re-modified code from @Vikas Goel But a very few cases it is not giving decent result

def get_pdf_searchable_pages(fname):
    """ intentifying a digitally created pdf or a scanned pdf"""    
    from pdfminer.pdfpage import PDFPage
    searchable_pages = []
    non_searchable_pages = []
    page_num = 0
    with open(fname, 'rb') as infile:

        for page in PDFPage.get_pages(infile):
            page_num += 1
            if 'Font' in page.resources.keys():
                searchable_pages.append(page_num)
            else:
                non_searchable_pages.append(page_num)
    if page_num == len(searchable_pages):
        return("searchable_pages")
    elif page_num != len(searchable_pages):
        return("non_searchable_pages")
    else:
        return("Not a valid document")
Glide answered 5/8, 2021 at 9:57 Comment(0)
Z
1

You can use ocrmypdf, it has a parameter to skip the text

more info here: https://ocrmypdf.readthedocs.io/en/latest/advanced.html

ocrmypdf.ocr(file_path, save_path, rotate_pages=True, remove_background=False, language=language, deskew=False, force_ocr=False, skip_text=True)
Zettazeugma answered 29/4, 2021 at 10:5 Comment(0)
I
0

If it is only all images or else, then here is another version to do this with PyMuPDF:

import fitz

my_pdf = r"C:\Users\Test\FileName.pdf"
doc = fitz.open(my_pdf) 
def pdftype(doc):
    i=0
    for page in doc:
        if len(page.getText())>0: #for scanned page it will be 0
            i+=1
    if i>0:
        print('full/partial text PDF file')
    else:
        print('only scanned images in PDF file')
pdftype(doc)
Imperator answered 13/1, 2022 at 4:24 Comment(1)
Could you explain your answer? What is the problem and how did you fix it?Bobolink
C
0

If your digital PDFs have a table of contents, you can use doc.get_toc() from PyMuPDF. As far as I'm aware, the scanned PDFs will never have a table of contents. There's no guarantee the digital ones will though, so it really depends on the context.

Carrico answered 31/3, 2023 at 8:17 Comment(0)
J
0

You could write some code which checks per page if there is an image with pypdf like this:

import pypdf
from pdfminer.high_level import extract_text

file_path = ['sample-2.pdf', 'image-based-pdf-sample.pdf']

for doc in file_path:
    reader = pypdf.PdfReader(doc)
    for i in range(len(reader.pages)):
        page = reader.pages[i]
        text = page.extract_text()
        total_words = len(text.split())
    if total_words > 0:
        print(f"This document has text: {doc}")
    else:
        print(f"This document has images: {doc}")

Output:

This document has text: sample-2.pdf
This document has images: image-based-pdf-sample.pdf

Sample data:

Jaddo answered 19/9, 2023 at 12:42 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.