How does one obtain the location of text in a PDF with PDFMiner? [duplicate] - McMap

About

How does one obtain the location of text in a PDF with PDFMiner? [duplicate]

Asked 11/8, 2014 at 16:35 Answered 12/8, 2014 at 10:53

python pdf position pdfminer

C

1

23

PDFMiner's documentation says:

PDFMiner allows one to obtain the exact location of text in a page

However, I have not been able to find how to do this. PDFMiner's 'documentation' is rather sparse, so I have not understood how to do this.

Christoper answered 11/8, 2014 at 16:35 Comment(0)

P

26

You are looking for the bbox property on every layout object. There is a little bit of information on how to parse the layout hierarchy in the PDFMiner documentation, but it doesn't cover everything.

Here's an example:

from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTTextLine, LTFigure


def parse_layout(layout):
    """Function to recursively parse the layout tree."""
    for lt_obj in layout:
        print(lt_obj.__class__.__name__)
        print(lt_obj.bbox)
        if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
            print(lt_obj.get_text())
        elif isinstance(lt_obj, LTFigure):
            parse_layout(lt_obj)  # Recursive


fp = open('example.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)

rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(doc):
    interpreter.process_page(page)
    layout = device.get_result()
    parse_layout(layout)

If you are interested in the location of individual LTChar objects, you can recursively parse into the child layout objects of LTTextBox and LTTextLine just like what is done with LTFigure in the above example.

Pressure answered 12/8, 2014 at 10:53 Comment(9)

1) Could you explain what LAParams() does, please? 2) Isn't it more pythonic to try to get text and then try to recurse rather than using isinstance? – Christoper 12/8, 2014 at 16:27

Aren't there other types of containers other than LTFigure? – Christoper 12/8, 2014 at 16:28

LAParams contains the parameters used for the layout analysis that merges characters into words and lines based on their locations. You can pass initialization parameters like line_overlap, char_margin, line_margin, word_margin, boxes_flow, detect_vertical. See PDFMiner docs for explanation and default values. – Pressure 12/8, 2014 at 16:38

Other than LTFigure there's also LTTextBox that contains LTTextLine which in turn contains LTChar and LTAnno. The PDFMiner docs have a diagram of the hierarchy. – Pressure 12/8, 2014 at 16:39

Things seem to work without passing LAParams, why are they needed? Isn't it more Pythonic to EAFP rather then use isinstance? – Christoper 12/8, 2014 at 17:1

LAParams is really just a way to modify the parameters used by the layout analyser. It's good practice to pass to PDFPageAggregator even if you just use the default parameters, because otherwise some of the layout analysis may not be performed. You probably can make my parse_layout function more pythonic. Every LT* object should be iterable even if it doesn't have any children, so the LTFigure isinstance check is probably unnecessary. Similarly, you could just attempt get_text() for all and catch the failure if it's not implemented on that LT* object. – Pressure 13/8, 2014 at 12:6

Is there any way to parse just first LTTextBox of each page?(actually I want the box header ) – Lemmie 24/1, 2018 at 21:38

What's your basis for thinking that recursing into LTFigures like this works? Over at https://mcmap.net/q/322126/-how-to-extract-text-and-text-coordinates-from-a-pdf-file, I claim it's broken because an LTFigure cannot contain an LTTextBox... but if I'm wrong, I'd appreciate you proving me so. – Freedman 18/11, 2018 at 11:44

Rather using LTTextBox, is there another parameter that will just find coordinates for individual words? – Envenom 4/12, 2019 at 20:17

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.