Remove / Delete all images from a PDF using Ghostscript or ImageMagick
Asked Answered
M

5

8

I want to delete / remove all the images in a PDF leaving only the text / font in the PDF with whatever command Line tool possible.

I tried using -dGraphicsAlphaBits=1 in a Ghostscript command but the images are present but like a big pixel.

Maieutic answered 19/12, 2013 at 8:29 Comment(3)
Basically you can't do this, you would need to modify the pdfwrite device to drop images.Dilorenzo
@kenS , ok sure, I will have a look into that.Maieutic
Also see this question.Barbaresi
H
4

No, AFAIK, it's not possible to remove all images in a PDF with a commandline tool.

What's the purpose of your request anyway? Save on filesize? Remove information contained in images? Or ...?

Workaround

Whatever you aim at, here is a command that will downsample all images to a resolution of 2 ppi (Update: 1 ppi doesn't work). Which achieves two goals at once:

  • reduce filesize
  • make all images basically un-comprehendable

Here's how to do it selectively, for only the images on page 33 of original.pdf:

gs                               \
  -o images-uncomprehendable.pdf \
  -sDEVICE=pdfwrite              \
  -dDownsampleColorImages=true   \
  -dDownsampleGrayImages=true    \
  -dDownsampleMonoImages=true    \
  -dColorImageResolution=2       \
  -dGrayImageResolution=2        \
  -dMonoImageResolution=2        \
  -dFirstPage=33                 \
  -dLastPage=33                  \
   original.pdf

If you want to do it for all images on all pages, just skip the -dFirstPage and -dLastPage parameters.

If you want to remove all color information from images, convert them to Grayscale in the same command (search other answers on Stackoverflow where details for this are discussed).


Update: Originally, I had proposed to use a resolution of 1 PPI. It seems this doesn't work with Ghostscript. I now tested with 2 PPI. This works.


Update 2: See also the following (new) question with the answer:

It provides some sample PostScript code which completely removes all (raster) images from the PDF, leaving the rest of the page layout unchanged.

It also reflects the expanded new capabilities of Ghostscript which can now selectively remove either all text, or all raster images, or all vector objects from a PDF, or any combination of these 3 types.

Harrow answered 19/12, 2013 at 12:38 Comment(4)
Thanks a lot @Kurt, I really wanted you to answer my question as it seems you are the only expert into processing pdfs. Actually my final aim is to generate two images, one containing the image layer and the other image containing only the text layer. Removing the background is just an effort for the final aim.Maieutic
But it actually is possible via a commandline tool, e. g. via cpdf. And there may be plenty of reasons why it is done - for example I can give you my reason, which is why I searched for this - I need to prepare for an exam but the image files are useless after knowing them already, so I just focus on the text first; and then as the second step, from that text, make notes what is worthy to be memorized and what is not. I can also think of many more possible reasons but I think on stackoverflow it is best to not ask WHY but to simply provide a solution that works.Fidole
@shevy: Please take note of the following facts: (1) The OP asked specifically for a Ghostscript or an ImageMagick solution. (2) My answer provided exactly what was asked for. I did provide it after John Whitington's pointing to cpdf (his own self-made tool, which is excellent!), because cpdf is not universally available as is Ghostscript (3) cpdf is a payware tool. Even though there is a free-of-charge version ("community edition"), this one is only legal to use for non-commercial purposes. (4) I did not ask for your reason -- I asked the OP, because it may be useful to know in...Harrow
@shevy: (/continued) ...in order to shape the answer accordingly. For example, if the main purpose of this question was to minimize file size then there may be other (additional) methods than just to remove images... (5) "StackOverflow [...] is best to [...] simply provide a solution that works". Thanks for the hint, mate. I would never have thought of that. Looking forward for all YOUR solutions that work! (6) And thanks for your downvote, anyway.Harrow
Q
21

You can use the draft option of cpdf:

cpdf -draft in.pdf -o out.pdf

This should work in most situations, but file a bug report if it doesn't do the right thing for you.

Disclosure: I am the author of cpdf.

Quickwitted answered 20/12, 2013 at 11:23 Comment(2)
Thanks, This works quiet well, It successfully removes all the images from the pdf. Next I tried to remove fonts from the pdf using command cpdf -remove-fonts in.pdf -o out.pdf but it leaves corrupted fonts / black blobs. Will look into that.Maieutic
I tried the same technique, and all of the images did get removed, but the text is selectable but not visible. Any idea how to deal with that?Rocca
H
18

Time has passed, and development of Ghostscript has progressed...

The latest releases have the following new command line parameters. These can be added to the command line:

  1. -dFILTERIMAGE: produces an output where all raster drawings are removed.

  2. -dFILTERTEXT: produces an output where all text elements are removed.

  3. -dFILTERVECTOR: produces an output where all vector drawings are removed.

Any two of these options can be combined.

Example command:

gs -o noimage.pdf -sDEVICE=pdfwrite -dFILTERIMAGE input.pdf

More details (including some illustrative screenshots) can be found in my answer to "How can I remove all images from a PDF?".

Harrow answered 16/6, 2016 at 16:33 Comment(1)
That's great! Really, thanks. It save my life.Forewent
H
4

No, AFAIK, it's not possible to remove all images in a PDF with a commandline tool.

What's the purpose of your request anyway? Save on filesize? Remove information contained in images? Or ...?

Workaround

Whatever you aim at, here is a command that will downsample all images to a resolution of 2 ppi (Update: 1 ppi doesn't work). Which achieves two goals at once:

  • reduce filesize
  • make all images basically un-comprehendable

Here's how to do it selectively, for only the images on page 33 of original.pdf:

gs                               \
  -o images-uncomprehendable.pdf \
  -sDEVICE=pdfwrite              \
  -dDownsampleColorImages=true   \
  -dDownsampleGrayImages=true    \
  -dDownsampleMonoImages=true    \
  -dColorImageResolution=2       \
  -dGrayImageResolution=2        \
  -dMonoImageResolution=2        \
  -dFirstPage=33                 \
  -dLastPage=33                  \
   original.pdf

If you want to do it for all images on all pages, just skip the -dFirstPage and -dLastPage parameters.

If you want to remove all color information from images, convert them to Grayscale in the same command (search other answers on Stackoverflow where details for this are discussed).


Update: Originally, I had proposed to use a resolution of 1 PPI. It seems this doesn't work with Ghostscript. I now tested with 2 PPI. This works.


Update 2: See also the following (new) question with the answer:

It provides some sample PostScript code which completely removes all (raster) images from the PDF, leaving the rest of the page layout unchanged.

It also reflects the expanded new capabilities of Ghostscript which can now selectively remove either all text, or all raster images, or all vector objects from a PDF, or any combination of these 3 types.

Harrow answered 19/12, 2013 at 12:38 Comment(4)
Thanks a lot @Kurt, I really wanted you to answer my question as it seems you are the only expert into processing pdfs. Actually my final aim is to generate two images, one containing the image layer and the other image containing only the text layer. Removing the background is just an effort for the final aim.Maieutic
But it actually is possible via a commandline tool, e. g. via cpdf. And there may be plenty of reasons why it is done - for example I can give you my reason, which is why I searched for this - I need to prepare for an exam but the image files are useless after knowing them already, so I just focus on the text first; and then as the second step, from that text, make notes what is worthy to be memorized and what is not. I can also think of many more possible reasons but I think on stackoverflow it is best to not ask WHY but to simply provide a solution that works.Fidole
@shevy: Please take note of the following facts: (1) The OP asked specifically for a Ghostscript or an ImageMagick solution. (2) My answer provided exactly what was asked for. I did provide it after John Whitington's pointing to cpdf (his own self-made tool, which is excellent!), because cpdf is not universally available as is Ghostscript (3) cpdf is a payware tool. Even though there is a free-of-charge version ("community edition"), this one is only legal to use for non-commercial purposes. (4) I did not ask for your reason -- I asked the OP, because it may be useful to know in...Harrow
@shevy: (/continued) ...in order to shape the answer accordingly. For example, if the main purpose of this question was to minimize file size then there may be other (additional) methods than just to remove images... (5) "StackOverflow [...] is best to [...] simply provide a solution that works". Thanks for the hint, mate. I would never have thought of that. Looking forward for all YOUR solutions that work! (6) And thanks for your downvote, anyway.Harrow
W
4
 gs -o noImages.pdf   -sDEVICE=pdfwrite -dFILTERIMAGE                input.pdf
 gs -o noText.pdf   -sDEVICE=pdfwrite -dFILTERTEXT                 input.pdf
 gs -o noVectors.pdf   -sDEVICE=pdfwrite -dFILTERVECTOR               input.pdf
 gs -o onlyImages.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERTEXT  input.pdf
 gs -o onlyText.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERIMAGE input.pdf
 gs -o onlyVectors.pdf -sDEVICE=pdfwrite -dFILTERIMAGE  -dFILTERTEXT  input.pdf
Watford answered 30/12, 2021 at 4:51 Comment(0)
H
2

To separate images and text to different layers, unfortunately there is no Free/Open Source Software utility available. Also not a free-as-in-beer one either...

This task can only be achieved with various payware software solutions. Since you didn't exclude this in your question, but you asked for 'whatever commandline tool possible', I'll tell you my favorite one:

A version for CLI usage (which includes a powerful SDK enabling lots of low-level PDF manipulations) is available, and this is supported on all major OS platforms, including Linux.

callas offers you a fully featured gratis test license which is enabled for (I believe) 14 days.

Harrow answered 19/12, 2013 at 18:20 Comment(7)
I too understand that it may not be possible to find an easy way. But I got partial success in generating background only image using imageMagick. i just used "-blur 0x0" and it generated a background only image. I understand that its not the proper way and results may vary between pdfs. am just trying if I can manage to reverse the effect so that text remains the next time. I will definitely tryout 'callas', its trial-ware for 7 days. I might end-up buying it if it works as expected,and if it isn't too heavy on the pocket.Maieutic
ImageMagick processes raster images only. In so far as it takes PDF as input... no, it doesn't take PDF itself, it calls Ghostscript as its delegate to convert the pages to a series of images first; for outputting PDF it again wraps the raster image into a thin PDF shell. Once data passes through ImageMagick, you only have raster data left. Just like after you turn a steak into minced meat: there is no way back to the original steak any more. I can tell you for sure that there is no way to employ ImageMagick to separate text and images occurring on the same PDF page into separate layers...Harrow
@codin: I'm now not so sure if we have the same understanding of 'layers' for a PDF. In the PDF specification, layers are also named Optional Content Groups (OCG). Do you mean this?Harrow
@codin: Can you supply a sample PDF (with one or only a few pages) where you want to separate images and text into different layers?Harrow
@codin: I seriously doubt that ImageMagick's -blur 0x0 will turn a mixed text/image PDF page into a file where you only see pixels from the image, and none from text....Harrow
Just finished installing GS and imageMagick on my home pc, I tried the convert -blur 0x0 in.pdf out.png on the same pdf, but it doesnt produce the image only output here. looks like a bug at my work pc.Maieutic
@codin: Even if it works, it will not get you anywhere, because you'll have image and text in one output file. Having said that, depending on your version of IM, you may need to use a different order of the command line arguments: convert in.pdf -blur 0x0 out.png.Harrow

© 2022 - 2024 — McMap. All rights reserved.