PDFMiner's documentation says:
PDFMiner allows one to obtain the exact location of text in a page
However, I have not been able to find how to do this. PDFMiner's 'documentation' is rather sparse, so I have not understood how to do this.
PDFMiner's documentation says:
PDFMiner allows one to obtain the exact location of text in a page
However, I have not been able to find how to do this. PDFMiner's 'documentation' is rather sparse, so I have not understood how to do this.
You are looking for the bbox
property on every layout object. There is a little bit of information on how to parse the layout hierarchy in the PDFMiner documentation, but it doesn't cover everything.
Here's an example:
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTTextLine, LTFigure
def parse_layout(layout):
"""Function to recursively parse the layout tree."""
for lt_obj in layout:
print(lt_obj.__class__.__name__)
print(lt_obj.bbox)
if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
print(lt_obj.get_text())
elif isinstance(lt_obj, LTFigure):
parse_layout(lt_obj) # Recursive
fp = open('example.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(doc):
interpreter.process_page(page)
layout = device.get_result()
parse_layout(layout)
If you are interested in the location of individual LTChar
objects, you can recursively parse into the child layout objects of LTTextBox
and LTTextLine
just like what is done with LTFigure
in the above example.
LTFigure
there's also LTTextBox
that contains LTTextLine
which in turn contains LTChar
and LTAnno
. The PDFMiner docs have a diagram of the hierarchy. –
Pressure LAParams
is really just a way to modify the parameters used by the layout analyser. It's good practice to pass to PDFPageAggregator
even if you just use the default parameters, because otherwise some of the layout analysis may not be performed. You probably can make my parse_layout
function more pythonic. Every LT*
object should be iterable even if it doesn't have any children, so the LTFigure
isinstance check is probably unnecessary. Similarly, you could just attempt get_text()
for all and catch the failure if it's not implemented on that LT*
object. –
Pressure LTFigure
s like this works? Over at https://mcmap.net/q/322126/-how-to-extract-text-and-text-coordinates-from-a-pdf-file, I claim it's broken because an LTFigure
cannot contain an LTTextBox
... but if I'm wrong, I'd appreciate you proving me so. –
Freedman LTTextBox
, is there another parameter that will just find coordinates for individual words? –
Envenom © 2022 - 2024 — McMap. All rights reserved.