Issues with PyMuPDF extracting plain text
Asked Answered
C

4

6

I want to read in a PDF file using PyMuPDF. All I need is plain text (no need to extract info on color, fonts, tables etc.).

I have tried the following

import fitz
from fitz import TextPage
ifile = "C:\\user\\docs\\aPDFfile.pdf"
doc = TextPage(ifile)
>>> TypeError: in method 'new_TextPage', argument 1 of type 'struct fz_rect_s *'

Which doesn't work, so then I tried

doc = fitz.Document(ifile)
t = TextPage.extractText(doc)
>>> AttributeError: 'Document' object has no attribute '_extractText'

which again doesn't work.

Then I found a great blog from one of the authors of PyMuPDF which has detailed code on extracting text in the order it is read from the file. But everytime I run this code with a different PDF I get KeyError: 'lines' (line 81 in the code) or KeyError: "bbox" (line 60 in the code).

I can't post the PDF's here because they are confidential, and I appreciate that would be useful information to have here. But is there any way I can just do the simplest task which PyMuPDF is meant to do: extract plain text from a PDF, un-ordered or otherwise (I don't mind much)?

Ceilidh answered 4/6, 2018 at 14:5 Comment(1)
Do a check that the key you are looking for (e.g., lines or bbox) is in the dictionary (e.g., a block) before accessing that key.Chinatown
F
9

The process of extracting text following your example using PyMuPDF is:

import fitz

filepath = "C:\\user\\docs\\aPDFfile.pdf"

text = ''
with fitz.open(filepath ) as doc:
    for page in doc:
        text+= page.getText()
print(text)

The blog you followed is great, but a little bit outdated, some of the methods are depreciated.

Fed answered 14/1, 2019 at 10:17 Comment(0)
H
9

Message from the repo maintainer:

The easiest way to extract plain text but still do at least basic ordering is

blocks = page.get_text("blocks")
blocks.sort(key=lambda block: block[1])  # sort vertically ascending

for b in blocks:
    print(b[4])  # the text part of each block

In newer versions (1.19.x and later), the above is even simpler: Just do text = page.get_text(sort=True). It will return the full page's text as a string and the basic reading order top-left to bottom-right.

Highbinder answered 11/6, 2020 at 13:44 Comment(0)
L
1
import fitz

filepath = "C:\\user\\docs\\aPDFfile.pdf"

text = ''
with fitz.open(filepath ) as doc:
    for page in doc:
        text+= page.get_text()

print(text)
Lowerclassman answered 11/12, 2021 at 3:13 Comment(1)
Remember that Stack Overflow isn't just intended to solve the immediate problem, but also to help future readers find solutions to similar problems, which requires understanding the underlying code. This is especially important for members of our community who are beginners, and not familiar with the syntax. Given that, can you edit your answer to include an explanation of what you're doing and why you believe it is the best approach?Shulins
K
-1

use small T in gettext():

import fitz

filepath = "C:\\user\\docs\\aPDFfile.pdf"

text = ''
with fitz.open(filepath ) as doc:
    for page in doc:
        text+= page.gettext()
print(text)

it's work for you

Kingery answered 11/10, 2022 at 9:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.