How to extract images from a PDF in their original format
Asked Answered
B

6

10

I'm using pdfimages -j bar.pdf /tmp/image to extract images from a PDF. My objective is to get them in their raw state as they were added. So If it was a .tif I'd like to get a .tif, if it's a jpg I'd like to get a .jpg. I keep getting .ppm for everything I extract.

Is it possible to get images in their original format or is ppm my only opiton?

Update: My primary objective for wanting to do this is to check the DPI of all of the images included in the document, or, check to see if they're vector.

Brace answered 25/1, 2013 at 13:4 Comment(0)
D
7

You can't (reliably) know the source image file format by looking at an image in PDF. For example, TIFF images can be compressed with (off the top of me head) none, RLE, CCITT (couple variations), LZW, Flate, Jpeg. If an image in a PDF is compressed with DCT (jpeg), how do you decide whether or not the source was TIFF or Jpeg? If it is compressed with Flate, how do you distinguish between TIFF and PNG? Further, it is the software generating the PDF which decides the compression, so I can take a Flate compressed TIFF image and encode it into a PDF using JPEG2000 or a CCITT compressed image and compress it with Jbig2 or a jpeg image, reduce it to an 8-bit paletted image and compress it with Flate.

TL;DR you can't know.

Degreeday answered 25/1, 2013 at 13:56 Comment(5)
My objective with getting the original file is I'd like to be able to check the DPI of all the images uploaded to ensure a minimum of 300 DPI. When I use Imagick's identifyImage (php.net/manual/en/imagick.identifyimage.php) it does not supply the resolution of the image, only width/height.Brace
PDF images don't have a resolution per se. Images are defined by a 2D set of samples with width and height. The effective resolution is how a particular image is placed on any given page and how that page is presented to the user. So I can place a 96 by 96 image in a 1 inch square and have 96 dpi, or I can put it in a 2 inch square and it will be 48 dpi.Degreeday
I'm trying to programmatically determine if the images are high res enough to be printed accurately. Are you saying that's not possible without knowing how the image is layed out in the document? i.e. - I can't just check the image itself.Brace
You can get the dimensions of the image and guess based on what size it is intended to be printed at.Degreeday
It's true you can't know what was the format of the image before it was inserted in the PDF, but you can certainly inspect the PDF file to know the format that was used to store the image inside the document -- which is what Kurt Pfeifle explains in his answer and probably what this question was about.Benavides
R
10

First, what in PDF parlance is called an 'image', by definition always is a raster image. There's no such thing as a 'vector image'. Even if the original file which was converted to PDF included vector graphics, then the converter program could have decided that it includes these as raster image. If you extract this, you'll not get your vector graphics back, but a raster image. Raster graphics which are preserved inside a PDF as such cannot be extracted by pdfimages.

Second, you do not need to actually extract the images using pdfimages. Provided you're using a current version (later than v0.20.2) of the 'Poppler' fork of pdfimages you can use the -list parameter to get a list of all images on a certain range of PDF pages:

pdfimages -list -f 7 -l 8  ct-magazin-14-2012.pdf

  page   num  type   width height color comp bpc  enc interp  object ID
  ---------------------------------------------------------------------
     7     0 image     581   838  rgb     3   8  jpeg   no        39  0
     7     1 image       4     4  rgb     3   8  image  no        40  0
     7     2 image     314   332  rgb     3   8  jpx    no        44  0
     7     3 image     358   430  rgb     3   8  jpx    no        45  0
     7     4 image       4     4  rgb     3   8  image  no        46  0
     7     5 image       4     4  rgb     3   8  image  no        47  0
     7     6 image       4     6  rgb     3   8  image  no        48  0
     7     7 image     596   462  rgb     3   8  jpx    no        49  0
     7     8 image       4     6  rgb     3   8  image  no        50  0
     7     9 image       4     4  rgb     3   8  image  no        51  0
     7    10 image       8    10  rgb     3   8  image  no        41  0
     7    11 image       6     6  rgb     3   8  image  no        42  0
     7    12 image     113    27  rgb     3   8  jpx    no        43  0
     8    13 image     582   839  gray    1   8  jpeg   no      2080  0
     8    14 image     344   364  gray    1   8  jpx    no      2079  0

Note again: this version of pdfimages is the one from Poppler (the one from XPDF does not (yet?) support this new feature).

As you can see this lists the respective widths and heights of the images. This however does not (yet) give you any clue about the DPI. If a large raster image is squeezed into a small space on the PDF page, your DPI value would be quite high. (This is what plinth's comment to his own answer also emphasizes...)

In order to calculate the DPI, you'll have to measure the width/height of the image as it is displayed on the page (you can do that with one of the tools in Acrobat/Reader) and then use the respective info from the above output to calculate the DPI.


Update

Recent versions of pdfimages now directly shows the actual resolution in DPI of the included images in additional columns. Obtaining this info was the original goal of the question:

  pdfimages -list -f 6 -l 7 example.pdf
  page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
  --------------------------------------------------------------------------------------------
     6     0 image    1901  1901  rgb     3   8  image  no       632  0  1818  1818  468K 4.4%
     6     1 image    1901  1901  rgb     3   8  image  no       645  0  1818  1818  521K 4.9%

The new output format additionally shows the respective horizontal and vertical resolutions for each image ('x-ppi', 'y-ppi'). It also gives the actual size of images in terms of storage ('size') and their compression ratios ('ratio').

(Thanks to @Eric for suggesting an update hinting at these new features of pdfimages.)

Rashidarashidi answered 24/2, 2013 at 16:40 Comment(0)
D
7

You can't (reliably) know the source image file format by looking at an image in PDF. For example, TIFF images can be compressed with (off the top of me head) none, RLE, CCITT (couple variations), LZW, Flate, Jpeg. If an image in a PDF is compressed with DCT (jpeg), how do you decide whether or not the source was TIFF or Jpeg? If it is compressed with Flate, how do you distinguish between TIFF and PNG? Further, it is the software generating the PDF which decides the compression, so I can take a Flate compressed TIFF image and encode it into a PDF using JPEG2000 or a CCITT compressed image and compress it with Jbig2 or a jpeg image, reduce it to an 8-bit paletted image and compress it with Flate.

TL;DR you can't know.

Degreeday answered 25/1, 2013 at 13:56 Comment(5)
My objective with getting the original file is I'd like to be able to check the DPI of all the images uploaded to ensure a minimum of 300 DPI. When I use Imagick's identifyImage (php.net/manual/en/imagick.identifyimage.php) it does not supply the resolution of the image, only width/height.Brace
PDF images don't have a resolution per se. Images are defined by a 2D set of samples with width and height. The effective resolution is how a particular image is placed on any given page and how that page is presented to the user. So I can place a 96 by 96 image in a 1 inch square and have 96 dpi, or I can put it in a 2 inch square and it will be 48 dpi.Degreeday
I'm trying to programmatically determine if the images are high res enough to be printed accurately. Are you saying that's not possible without knowing how the image is layed out in the document? i.e. - I can't just check the image itself.Brace
You can get the dimensions of the image and guess based on what size it is intended to be printed at.Degreeday
It's true you can't know what was the format of the image before it was inserted in the PDF, but you can certainly inspect the PDF file to know the format that was used to store the image inside the document -- which is what Kurt Pfeifle explains in his answer and probably what this question was about.Benavides
C
2

I agree with plinth, you probably can't determine the original image format used. ppm is not your only output option tho.

Pdfimages reads the PDF file, scans one or more pages, and writes one PPM, PBM, or JPEG file for each image, image-root-nnn.xxx, where nnn is the image number and xxx is the image type (.ppm, .pbm, .jpg).

http://linux.die.net/man/1/pdfimages

In addition, you can of course change the format using e.g. image magick's convert

Carsoncarstensz answered 25/1, 2013 at 14:8 Comment(1)
convert PPM to PNG or JPEG ?Riella
R
1

I'm adding another answer, which deals with the 'Update' to the original question saying:

"My primary objective for wanting to do this is to check the DPI of all of the images included in the document, or, check to see if they're vector."

You can use Ghostscript to selectively remove (or, retain) text, pixel image and vector graphic areas on each page.

The key to this is to apply the new CLI parameters

  • -dFILTERIMAGE,
  • -dFILTERTEXT and/or
  • -dFILTERVECTOR

accordingly.

The details of this method are described here; the answer contains screenshots visualizing the results:

How can I remove all images from a PDF?

Top row, from left: all "text" removed; all "images" removed; all "vectors" removed. Bottom row, from left: only "text" kept; only "images" kept; only "vectors" kept.
Top row, from left: all "text" removed; all "images" removed; all "vectors" removed. Bottom row, from left: only "text" kept; only "images" kept; only "vectors" kept.


Rashidarashidi answered 27/8, 2019 at 18:35 Comment(0)
E
1

For those who still wonder, pdfimages -all is the modern solution:

-all: Write JPEG, JPEG2000, JBIG2, and CCITT images in their native format. CMYK files are written as TIFF files. All other images are written as PNG files. This is equivalent to specifying the options -png -tiff -j -jp2 -jbig2 -ccitt.

Eugenioeugenius answered 28/3, 2021 at 10:32 Comment(0)
A
0

You would need to get the image XObject (which contains the original image width and height) and then the actual displayed dimensions and you could then work this out.

Aromaticity answered 26/1, 2013 at 17:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.