Delete text from pdf using PyMUPDF

Asked 27/4, 2022 at 18:28 Answered 9/5, 2023 at 14:52

I need to remove the text "DRAFT" from a pdf document using Python. I can find the text box containing the text but can't find an example of how to edit the pdf text element using pymupdf.

In the example below the draft object contains the coords and text for the DRAFT text element.

import fitz

fname = r"original.pdf"
doc = fitz.open(fname)
page = doc.load_page(0)

draft = page.search_for("DRAFT")

# insert code here to delete the DRAFT text or replace it with an empty string

out_fname = r"final.pdf"
doc.save(out_fname)

Added 4/28/2022 I found a way to delete the text but unfortunately it also deletes any overlapping text underneath the box around DRAFT. I really just want to delete the DRAFT letters without modifying underlying layers

# insert code here to delete the DRAFT text or replace it with an empty string
rl = page.search_for("DRAFT", quads = True)
page.add_redact_annot(rl[0])

page.apply_redactions()

Vladikavkaz answered 27/4, 2022 at 18:28 Comment(1)

In this case, a map exported from ArcGIS Pro, the Draft is just a horizontal text element overlaid over other text. I'm not sure what anylyser is – Vladikavkaz 28/4, 2022 at 19:5

You can try this.

import fitz

doc = fitz.open("xxxx")

for page in doc:
    for xref in page.get_contents():
        stream = doc.xref_stream(xref).replace(b'The string to delete', b'')
        doc.update_stream(xref, stream)

Chalcanthite answered 26/9, 2022 at 8:25 Comment(3)

It will be better if you can explain in a few words what your code is doing. – Bubal 29/9, 2022 at 22:11

For anyone else who gets here. This didn't work for my use-case. I have a diagonal "draft" text that is overlaid over the document that I need to remove. The above solution works to delete horizontal text. – Alveraalverez 28/10, 2022 at 3:1

THANK YOU. way better than the "annotation solution" that is recommended. – Organon 2/7 at 13:27

This is example how to manipulate PDF page strings by modifying draw commands (Tj operator).

This example just removes any draw string command from the page. Replacing in some cases may be done by simple bytes.replace(), but in some cases it may be non trivial task, since there is possibility that each character may be separated command and they even may be not in "human visible" order.

# more about text operators:
# https://www.syncfusion.com/succinctly-free-ebooks/pdf/text-operators
def remove_tj(self, page: fitz.Page):
    doc: fitz.Document = page.parent
    
    xref_page = page.xref
    if xref_page == 0:
      raise RuntimeError("page xref is zero")
    
    props = doc.xref_get_keys(xref_page)
    if 'Contents' not in props:
      raise RuntimeError("no 'Contents' key in page dict")
    
    content = doc.xref_get_key(xref_page, 'Contents')
    
    if content[0] == 'xref':
      if content[1].endswith(' 0 R'):
        contents_xref = int(content[1][:-4]) # 'contents' is referance to other xref
      else:
        raise RuntimeError('PDF struct issue #2')
    else:
      raise RuntimeError('PDF struct issue #1')
    
    if not doc.xref_is_stream(contents_xref):
      raise RuntimeError('PDF struct issue #3')
    
    # page content commands stream (commands are separated by ASCII '\r'):
    cmds: 'list[bytes]' = doc.xref_stream(contents_xref).split(b'\r')
    
    i = 0
    while i < len(cmds):
      if cmds[i].endswith(b' Tj'): # draw string operator
        print(cmds[i][1:-4]) # string usually is in brackets:  ( characters may contain \x hex encoded values) Tj
        # here you can manipulate text bytes
        # words may be split into few Tj operator fragments
        cmds.pop(i) # for example this will remove any text operator from the page
      else:
        i += 1
    
    doc.update_stream(contents_xref, b'\r'.join(cmds), new=0, compress=1)

Bate answered 9/5, 2023 at 14:52 Comment(0)

Recommended topics

Hot tags