How does one obtain the location of text in a PDF with PDFMiner? [duplicate]
Asked Answered
C

1

23

PDFMiner's documentation says:

PDFMiner allows one to obtain the exact location of text in a page

However, I have not been able to find how to do this. PDFMiner's 'documentation' is rather sparse, so I have not understood how to do this.

Christoper answered 11/8, 2014 at 16:35 Comment(0)
P
26

You are looking for the bbox property on every layout object. There is a little bit of information on how to parse the layout hierarchy in the PDFMiner documentation, but it doesn't cover everything.

Here's an example:

from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTTextLine, LTFigure


def parse_layout(layout):
    """Function to recursively parse the layout tree."""
    for lt_obj in layout:
        print(lt_obj.__class__.__name__)
        print(lt_obj.bbox)
        if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
            print(lt_obj.get_text())
        elif isinstance(lt_obj, LTFigure):
            parse_layout(lt_obj)  # Recursive


fp = open('example.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)

rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(doc):
    interpreter.process_page(page)
    layout = device.get_result()
    parse_layout(layout)

If you are interested in the location of individual LTChar objects, you can recursively parse into the child layout objects of LTTextBox and LTTextLine just like what is done with LTFigure in the above example.

Pressure answered 12/8, 2014 at 10:53 Comment(9)
1) Could you explain what LAParams() does, please? 2) Isn't it more pythonic to try to get text and then try to recurse rather than using isinstance?Christoper
Aren't there other types of containers other than LTFigure?Christoper
LAParams contains the parameters used for the layout analysis that merges characters into words and lines based on their locations. You can pass initialization parameters like line_overlap, char_margin, line_margin, word_margin, boxes_flow, detect_vertical. See PDFMiner docs for explanation and default values.Pressure
Other than LTFigure there's also LTTextBox that contains LTTextLine which in turn contains LTChar and LTAnno. The PDFMiner docs have a diagram of the hierarchy.Pressure
Things seem to work without passing LAParams, why are they needed? Isn't it more Pythonic to EAFP rather then use isinstance?Christoper
LAParams is really just a way to modify the parameters used by the layout analyser. It's good practice to pass to PDFPageAggregator even if you just use the default parameters, because otherwise some of the layout analysis may not be performed. You probably can make my parse_layout function more pythonic. Every LT* object should be iterable even if it doesn't have any children, so the LTFigure isinstance check is probably unnecessary. Similarly, you could just attempt get_text() for all and catch the failure if it's not implemented on that LT* object.Pressure
Is there any way to parse just first LTTextBox of each page?(actually I want the box header )Lemmie
What's your basis for thinking that recursing into LTFigures like this works? Over at https://mcmap.net/q/322126/-how-to-extract-text-and-text-coordinates-from-a-pdf-file, I claim it's broken because an LTFigure cannot contain an LTTextBox... but if I'm wrong, I'd appreciate you proving me so.Freedman
Rather using LTTextBox, is there another parameter that will just find coordinates for individual words?Envenom

© 2022 - 2024 — McMap. All rights reserved.