Building on top of Rahul Agarwal's solution, along with some snippets I found at this link, here is a possible algorithm that should solve your problem.
You need to install fitz
and PyMuPDF
modules. You can do it by means of pip
.
The following code has been tested with Python 3.7.9 and PyMuPDF
1.16.14. Moreover, it is important to install fitz
BEFORE PyMuPDF
, otherwise it provides some weird error about a missing frontend module (no idea why). So here is how I install the modules:
pip3 install fitz
pip3 install PyMuPDF==1.16.14
And here is the Python 3 implementation:
import fitz
def get_text_percentage(file_name: str) -> float:
"""
Calculate the percentage of document that is covered by (searchable) text.
If the returned percentage of text is very low, the document is
most likely a scanned PDF
"""
total_page_area = 0.0
total_text_area = 0.0
doc = fitz.open(file_name)
for page_num, page in enumerate(doc):
total_page_area = total_page_area + abs(page.rect)
text_area = 0.0
for b in page.getTextBlocks():
r = fitz.Rect(b[:4]) # rectangle where block text appears
text_area = text_area + abs(r)
total_text_area = total_text_area + text_area
doc.close()
return total_text_area / total_page_area
if __name__ == "__main__":
text_perc = get_text_percentage("my.pdf")
print(text_perc)
if text_perc < 0.01:
print("fully scanned PDF - no relevant text")
else:
print("not fully scanned PDF - text is present")
Although this answers your question (i.e. distinguish between fully scanned and full/partial textual PDFs), this solution is not able to distinguish between full-textual PDFs and scanned PDFs that also have text within them (e.g. this is the case for scanned PDFs processed by OCR sofware - such as pdfsandwich or Adobe Acrobat - that adds "invisible" text blocks on top of the image, so that you can select the text).