How to get background color of a Text in PyMuPDF
Asked Answered
I

2

5

Am trying to see if I can identify possible table headers in a table inside PDF using background and foreground color of the text. With PyMuPDF text extraction, I was able to get the foreground color. Wondering if there is a way to get background color too.

Am using pymupdf 1.16.2 with python 3.7 I have checked the documentation but could find only one color field, which is associated with Text-color not background-color

if anyone knows how to get the background color using pyMuPDF or may be some other package, please let me know

International answered 26/9, 2019 at 6:30 Comment(0)
R
8

I needed a similar function but couldn't find it in PyMuPDF, so I write a function to get the color of the pixel in the top-left bbox containing the text.

def getText2(page: fitz.Page, zoom_f=3) -> dict:
    """
    Function similar to fitz.Page.getText("dict"). But the returned dict
    also contains a key "bg_color" with color tuple as value for each block in "blocks".
    """
    # Retrieves the content of the page
    all_words = page.getText("dict")

    # Transform page into PIL.Image
    mat = fitz.Matrix(zoom_f, zoom_f)
    pixmap = page.getPixmap(mat)
    img = Image.open(io.BytesIO(pixmap.getPNGData()))
    img_border = fitz.Rect(0, 0, img.width, img.height)
    for block in all_words['blocks']:
        # Retrieve only text block (type 0)
        if block['type'] == 0:
            rect = fitz.Rect(*tuple(xy * zoom_f for xy in block['bbox']))
            if img_border.contains(rect):
                color = img.getpixel((rect.x0, rect.y0))
                block['bg_color'] = tuple(c/255 for c in color)
    return all_words
Resound answered 15/1, 2020 at 13:43 Comment(0)
M
0

In my case, the background colors came from filled rectangles behind the text. You can get all "drawings", including rectangles, using this page method:

paths = page.get_drawings()

The details of how to get rects from paths are given in the documentation: extract drawings.

You can use the bounding box coordinates of the rectangles to determine which rectangle is behind the text you are interested in.

One slight complication is that the bounding box coordinates of the rectangles can be off by a few picas. This means that if you require that the text bounding box to be inside the rectangle bounding box, you will end up with lots of text, that looks like it has background color, is not inside any rectangle.

In my case, I just required that the center of the text bounding box be inside the rectangle.

Magyar answered 6/1 at 16:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.