read, highlight, save PDF programmatically
Asked Answered
C

3

13

I'd like to write a small script (which will run on a headless Linux server) that reads a PDF, highlights text that matches anything in an array of strings that I pass, then saves the modified PDF. I imagine I'll end up using something like the python bindings to poppler but unfortunately there's next to zero documentation and I have next to zero experience in python.

If anyone could point me to a tutorial, example, or some helpful documentation to get me started it would be greatly appreciated!

Camise answered 30/9, 2011 at 3:10 Comment(5)
This is generally not 100% fool-proof, as any PDF compiler - even an old and trusty one like pdftex might draw pdf inlines every which place... Are you sure that your PDFs can be read in such a way?Spud
The way I see it, the 'find' function in Evince (or most other PDF readers, for that matter) does basically what I want -- it highlights matched text, in basically any PDF. If it can render such highlighting to the screen, why not render it out to a file?Camise
It's just a little tricky, because PDF doesn't generally provide text flow. It's more like an image - text can appear anywhere. Often it looks good for the reader, but is internally a mess. To wit - often text justification is achieved by breaking up text and just placing inlines so that it appears justified. Anyway, when Evince highlights something it's either being clever, your PDF is well behaved or you just get lucky because that particular string resides as a continuous entity in the PDF. Anyway, have a look at itextpdf.com it's the best free library out there.Spud
Did you ever find an answer to this question? If so, I would like to hear it :)Rollin
For people coming here via Google: How to extract Highlighted Parts from PDF filesGraduation
G
4

Yes, it is possible with a combination of pdfminer (pip install pdfminer.six) and PyPDF2.

First, find the coordinates (e.g. like this). Then highlight it:

#!/usr/bin/env python

"""Create sample highlight in a PDF file."""

from PyPDF2 import PdfFileWriter, PdfFileReader

from PyPDF2.generic import (
    DictionaryObject,
    NumberObject,
    FloatObject,
    NameObject,
    TextStringObject,
    ArrayObject
)


def create_highlight(x1, y1, x2, y2, meta, color=[0, 1, 0]):
    """
    Create a highlight for a PDF.

    Parameters
    ----------
    x1, y1 : float
        bottom left corner
    x2, y2 : float
        top right corner
    meta : dict
        keys are "author" and "contents"
    color : iterable
        Three elements, (r,g,b)
    """
    new_highlight = DictionaryObject()

    new_highlight.update({
        NameObject("/F"): NumberObject(4),
        NameObject("/Type"): NameObject("/Annot"),
        NameObject("/Subtype"): NameObject("/Highlight"),

        NameObject("/T"): TextStringObject(meta["author"]),
        NameObject("/Contents"): TextStringObject(meta["contents"]),

        NameObject("/C"): ArrayObject([FloatObject(c) for c in color]),
        NameObject("/Rect"): ArrayObject([
            FloatObject(x1),
            FloatObject(y1),
            FloatObject(x2),
            FloatObject(y2)
        ]),
        NameObject("/QuadPoints"): ArrayObject([
            FloatObject(x1),
            FloatObject(y2),
            FloatObject(x2),
            FloatObject(y2),
            FloatObject(x1),
            FloatObject(y1),
            FloatObject(x2),
            FloatObject(y1)
        ]),
    })

    return new_highlight


def add_highlight_to_page(highlight, page, output):
    """
    Add a highlight to a PDF page.

    Parameters
    ----------
    highlight : Highlight object
    page : PDF page object
    output : PdfFileWriter object
    """
    highlight_ref = output._addObject(highlight)

    if "/Annots" in page:
        page[NameObject("/Annots")].append(highlight_ref)
    else:
        page[NameObject("/Annots")] = ArrayObject([highlight_ref])


def main():
    pdf_input = PdfFileReader(open("samples/test3.pdf", "rb"))
    pdf_output = PdfFileWriter()

    page1 = pdf_input.getPage(0)

    highlight = create_highlight(89.9206, 573.1283, 376.849, 591.3563, {
        "author": "John Doe",
        "contents": "Lorem ipsum"
    })

    add_highlight_to_page(highlight, page1, pdf_output)

    pdf_output.addPage(page1)

    output_stream = open("output.pdf", "wb")
    pdf_output.write(output_stream)


if __name__ == '__main__':
    main()
Graduation answered 13/7, 2017 at 13:28 Comment(2)
hi, are we able to highlite all line using only one Y coordinates? for example only y1 and mark everything from left to right side? thanks!Hinrichs
not sure why this answer isnt upvoted more :) ..life saving stuff ..big thanks to the authorCowbell
R
3

Have you tried looking at PDFMiner? It sounds like it does what you want.

Redhot answered 30/9, 2011 at 3:19 Comment(1)
From what I gather, PDFMiner is aimed toward the PDF->text extraction end of things; it doesn't look like it can highlight and render the altered PDF to a file.Camise
A
1

PDFlib has Python bindings and supports these operations. You will want with PDI if you want to open a PDF. http://www.pdflib.com/products/pdflib-family/pdflib-pdi/ and TET.

Unfortunately, it is a commercial product. I have used this library in production in the past and it works great. The bindings are very functional and not so Python. I have seen some attempts to make them more Pythonic: https://github.com/alexhayes/pythonic-pdflib You will want to use: open_pdi_document().

It sounds like you will want to do search highlighting of some sort:

http://www.pdflib.com/tet-cookbook/tet-and-pdflib/highlight-search-terms/

Acetabulum answered 19/3, 2015 at 17:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.