I am trying to extract bold text elements from PDFs using PyMUPDF 1.18.14. I was hoping that this would work as I understand from the docs that flags=4
targets bold font.
page = doc[1]
text = page.get_text(flags=4)
print(text)
But it prints out all text on the page and not just bold text.
When using the TextPage.extractDICT() (or Page.get_text(“dict”))
like this:-
page.get_text("dict", flags=11)["blocks"]
The flag works but I am having trouble understanding what it is doing. Maybe switching between image and text blocks.
Span
So it seems you have to get to the span
to be able to access flags.
<page>
<text block>
<line>
<span>
<char>
<image block>
<img>
So you can then do something like this, I used flags=20
on the span tag to get the bold font.
page = doc[1]
blocks = page.get_text("dict", flags=11)["blocks"]
for b in blocks: # iterate through the text blocks
for l in b["lines"]: # iterate through the text lines
for s in l["spans"]: # iterate through the text spans
if s["flags"] == 20: # 20 targets bold
print(s)
But this seems like a long away around.
So my question is this the best way of finding bold elements or is there something I am missing?
Would be great to be able to search for bold elements using page.search_for()