Can I bulk-remove links from a pdf from the command line?
Asked Answered
W

3

5

I'm downloading some newspapers as pdf (for posterity). One title is a pain, it includes URI links in the pdf itself, if you accidentally click these it opens a browser tab to a page that 500s. It's not so bad on a desktop computer, but a pain in the butt if someone is reading it with a tablet. Each issues has approximately 200 of these links.

For a different title, it was as simple as using QPDF, like so:

qpdf --qdf --object-streams=disable file temp-file

This puts the temp version into postscript mode or something, and I was able to nuke the links with something like this:

s/obj\n<<\n(  \/A <<\n    \/S \/URI.+?)>>\nendobj/"obj\n<<\n" . " " x length($1). ">>\nendobj"/sge

This still works. However, a 15 meg original pdf is now becoming a 108meg "fixed" pdf. I can accept some bloat, but 720% is a bit absurd (I think it was more like 10% on the other title). Whenever I google for how to do this, I get results for Acrobat Reader and how you can click around in 20 menus to do such... does no one that uses Adobe products ever want to automate this stuff? There are between 180 and 300 links in a typical issue, spread across 45-150 pages (Sunday editions).

Are there any tools that can do this? Are there any clever arguments to qpdf that will make this more reasonable?

PS Yes I know it's hacky as hell to just overwrite the URIs with spaces, but I've never managed to figure out how to remove the objects entirely since their references also have to be removed.

Wabble answered 4/9, 2022 at 3:22 Comment(0)
S
3

You can do this with the community edition of cpdf: https://community.coherentpdf.com/

To remove all links in a PDF (well, to replace them with an empty link):

cpdf -replace-dict-entry /URI cpdfmanual.pdf -replace-dict-entry-value '""' -o out.pdf

This does not remove the annotations - it just makes sure that clicking on them won't go anywhere. It leaves the annotation in place, but with an empty link. You could replace with a working URL too, of course:

cpdf -replace-dict-entry /URI cpdfmanual.pdf -replace-dict-entry-value '"https://www.google.com/"' -o out.pdf

(You can also use -replace-dict-entry-search to replace only certain URLs - see the manual.)

Or, if you just want rid of all the annotations (link and non-link):

cpdf -remove-annotations in.pdf -o out.pdf
Synaesthesia answered 4/9, 2022 at 13:16 Comment(1)
The difficulty with round-tripping annotations is deciding what objects to output to JSON - or you might end up getting the whole file. For the future, I noticed in the ISO standard the following: "In addition, beginning in PDF 1.3, FDF can be used to define a container for annotations that are separate from the PDF document to which they apply.". So FDF files might be a solution. Of course, you can round-trip the whole file through JSON with -output-json and -j at the moment, which I forgot to mention in my answer - that would be an alternative method.Synaesthesia
E
2

You can use HexaPDF (you need to have Ruby installed and then use gem install hexapdf to install HexaPDF) and the following small script to remove the links:

require 'hexapdf'

HexaPDF::Document.open(ARGV[0]) do |doc|
  doc.pages.each do |page|
    page.each_annotation.select {|annot| annot[:Subtype] == :Link}.each do |annot|
      page[:Annots].delete(annot)
    end
  end
  doc.write(ARGV[0] + '_processed.pdf', optimize: true)
end

Then batch execute the script for all the files you want the links removed.

Note that this will remove all links.

Erato answered 4/9, 2022 at 10:35 Comment(0)
W
2

Just to round off the options I would suggest the best is potentially a PDF dedicated command line tool such as cpdf answer by johnwhitington or a dedicated library like iText.

There are several alternative methods touted for batch text editing your using qpdf

"temp version into postscript mode or something,"

That is a converted pdf into plain old decompressed text/pdf hybrid qdf so you can run sed or similar string editor. Here the primary difference is the upper out.pdf file shows as an editable QDF-1.0 version after editing so needs conversion to a conventional PDF as seen in the lower part where the stream is binary thus recompressed.

enter image description here

1) qpdf
At end of a bloating edit exercise the idea is to reverse back to application/pdf using

fix-qdf file-temp.pdf>out.pdf

to tidy up redirects and then

qpdf --compress-streams=y out.pdf outfixed.pdf

back to fixed.pdf

Other cross platform means are using

2) pdftk

$ pdftk infile.pdf output outfile.pdf uncompress

edit with vim or whatever sed scripting method then

$ pdftk outfile.pdf output fixedfile.pdf compress 

3) mutool

mutool clean -d [options] input.pdf [output.pdf] [pages]

-d Decompress streams. This will make the output file larger, but provides easy access for reading and editing the contents with a text editor.
-i Toggle decompression of image streams. Use in conjunction with -d to leave images compressed.
-f Toggle decompression of font streams. Use in conjunction with -d to leave fonts compressed.
-a ASCII Hex encode binary streams. Use in conjunction with -d and -i or -f to ensure that although the images and/or fonts are compressed, the resulting file can still be viewed and edited with a text editor.

Whichever options you use, need to be reversed when recompressing

NOTE

Using text editors will potentially corrupt binary fonts and binary images, thus they need monitoring for any corruption in an editor that changes encoding or line feeds. This pdftk sample shows the image stream has been decompressed well into simple text but beware any change of End Of Line by editor would break up that stream

enter image description here enter image description here

Additionally when making text edits that are not simple byte wise "find and replace", the xref table can be corrupted too much to be reindexed by recompression, try to overwrite with same number of characters when using a text edit method.

SIDE NOTE
EVEN if you remove actions and external hyperlinks actions but the text is present the reader will still provide that exploitable action. Same as here https://google.com but html will highlight usually in blue underline.

enter image description here enter image description here

Hence ensure security is on enter image description here

Whither answered 4/9, 2022 at 15:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.