How to find and replace text in a existing PDF file with PDFTK (or other command line application) [closed]
Asked Answered
A

3

38

I have on each page of my PDF document a line with this string:

%REPLACE%

Which I'd like to find and replace with another string.

Does anyone know how to do this with some command line application such as PDFTK?

This folk gave me an important clue however I'd like something more direct.

Thanks.

Apologetic answered 26/3, 2012 at 11:52 Comment(2)
Does this answer your question? How to program a text search and replace in PDF filesAbbey
I added an answer to the above question of a custom program I wrote for this purpose https://mcmap.net/q/376836/-how-to-program-a-text-search-and-replace-in-pdf-filesAbbey
O
53

You can try to modify content of your PDF as follows

  1. Uncompress the text streams of PDF

    pdftk file.pdf output uncompressed.pdf uncompress
    
  2. Use sed to replace your text with another

    sed -e "s/ORIGINALSTRING/NEWSTRING/g" <uncompressed.pdf >modified.pdf
    
  3. If this attempt was successful, re-compress the PDF with pdftk

    pdftk modified.pdf output recompressed.pdf compress
    

Note: This way is not successful every time, mainly due to font subsetting

Oliphant answered 26/3, 2012 at 12:54 Comment(8)
I can't make this work with the PDF file exported from Google Docs (even when I choose arial as the only font). I am afraid that I'd have to use some other application only to write the page and then try the very simple and wonderful code you wrote...Apologetic
with pdfedit you can have more chances (if fonts are fully embedded) to edit text content - pdfedit.cz/en/index.htmlOliphant
pdfedit can be used also from command line without gui (see its site for command line utilities)Oliphant
Note that this will only work when the text is using Tj command in PDF along with plain ASCII chars. As soon as octal, hex or glyph refences are used, you are lost.Priebe
For anyone with Mac M1 this might be useful - #60860027Pterodactyl
I had to replace sed, because of encoding issues, with perl -pi.bak -e 's/findthis/replacewiththis/g' uncompressed.pdf from https://mcmap.net/q/234453/-how-to-replace-a-string-in-an-existing-file-in-perlAlmemar
Is this able to use regex for sed? Without regex, it works. But with regex it says ``` Error: Unable to find file. Error: Failed to open PDF file: modified.pdf Errors encountered. No output created. Done. Input errors, so no output created. ```Lussi
I suspect pdfbox has something available what would help with the font subsetting. I have an example to start working with: Forked from gist.github.com/DavidYKay/82f20ba67c50c499ebb3 from * jackson-brain.com/…Dailey
S
1

For making a small change just on a few pages, inkscape can do a good job. It can also fix some issues in diagrams and with table borders. One must process each page separately, though, and stick the pages back together using pdfunite. (Unchanged page ranges can be extracted with pdfseparate.)

Inspiration: https://tatica.org/2015/07/13/edit-pdf-inkscape/

Signe answered 10/6, 2021 at 14:55 Comment(2)
For simple changes, this works with Inkscape. Inkscape 1.2 (released on 2022-02-05) supports multi-page PDF documents for both import and export, so it is no longer needed to use pdfunite. To be able to edit the text, one first needs to do an Ungroup on the object that consists of a full PDF page.Burgin
"Inkscape encountered an internal error and will close now"Necrolatry
W
-1

changepagestring will do this in a single step, as easy as:

changepagestring -o -v infile.pdf search-regex replace-str outfile.pdf

However like the currently accepted answer, this is hit or miss and doesn't work as expected with all files.

Whoremaster answered 5/6, 2021 at 1:26 Comment(4)
Yes sadly this didn't work with my file, it could find 2 letters but not the whole word I wanted to findExplicative
I've been finding that when this fails, it's often just a matter of finding the right regex. I haven't figured out if there's a way to see the text exactly as needed to understand how it works, but a regex like 'word1*word2' may work where 'word1 word2' fails.Whoremaster
In Debian/unstable, changepagestring does not work at all (I've tried on a single word, so this is simpler than a regexp), even on a simple PDF file obtained with pdflatex, for which pdftotext can find the word. Debian bug 1019979.Burgin
Didn't work for my PDF: couldn't find even a 2-letter word, and when using one letter, the output lost most of the formatting, which should not be touched by a search & replace.Necrolatry

© 2022 - 2024 — McMap. All rights reserved.