extracting text using flags to focus on bold / italic font using PyMUPDF
Asked Answered
L

1

6

I am trying to extract bold text elements from PDFs using PyMUPDF 1.18.14. I was hoping that this would work as I understand from the docs that flags=4 targets bold font.

page = doc[1]
text = page.get_text(flags=4)
print(text)

But it prints out all text on the page and not just bold text.

When using the TextPage.extractDICT() (or Page.get_text(“dict”)) like this:-

page.get_text("dict", flags=11)["blocks"]

The flag works but I am having trouble understanding what it is doing. Maybe switching between image and text blocks.

Span

So it seems you have to get to the span to be able to access flags.

<page>
    <text block>
        <line>
            <span>
                <char>
    <image block>
        <img>

enter image description here

So you can then do something like this, I used flags=20 on the span tag to get the bold font.

page = doc[1]
blocks = page.get_text("dict", flags=11)["blocks"]
for b in blocks:  # iterate through the text blocks
    for l in b["lines"]:  # iterate through the text lines
        for s in l["spans"]:  # iterate through the text spans
            if s["flags"] == 20:  # 20 targets bold
                print(s)

But this seems like a long away around.

So my question is this the best way of finding bold elements or is there something I am missing?

Would be great to be able to search for bold elements using page.search_for()

Lupita answered 14/7, 2021 at 17:42 Comment(1)
I am looking to do something similar. Your Q was surprisingly useful in helping me navigate the documentation. So thanks. From what I can see there is no simpler way.Blume
H
0

I am also trying to extract bold text from pdfs and I tried your code and I found out that in some pdfs the bold text may not always have a flag value of 20, instead the flag value I found was 16. I might be confusing something as pymupdf is new to me.

Hocker answered 29/3 at 8:24 Comment(1)
As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.Solifluction

© 2022 - 2024 — McMap. All rights reserved.