extracting text using flags to focus on bold / italic font using PyMUPDF

I am trying to extract bold text elements from PDFs using PyMUPDF 1.18.14. I was hoping that this would work as I understand from the docs that flags=4 targets bold font.

page = doc[1]
text = page.get_text(flags=4)
print(text)

But it prints out all text on the page and not just bold text.

When using the TextPage.extractDICT() (or Page.get_text(“dict”)) like this:-

page.get_text("dict", flags=11)["blocks"]

The flag works but I am having trouble understanding what it is doing. Maybe switching between image and text blocks.

Span

So it seems you have to get to the span to be able to access flags.

<page>
    <text block>
        <line>
            <span>
                <char>
    <image block>
        <img>

So you can then do something like this, I used flags=20 on the span tag to get the bold font.

page = doc[1]
blocks = page.get_text("dict", flags=11)["blocks"]
for b in blocks:  # iterate through the text blocks
    for l in b["lines"]:  # iterate through the text lines
        for s in l["spans"]:  # iterate through the text spans
            if s["flags"] == 20:  # 20 targets bold
                print(s)

But this seems like a long away around.

So my question is this the best way of finding bold elements or is there something I am missing?

Would be great to be able to search for bold elements using page.search_for()

Recommended topics

Hot tags