PDF: extracted images are sliced / tiled

Asked 19/1, 2015 at 11:13 Answered 19/1, 2015 at 15:24

Image extraction with pdfimages and mupdf/mutool works fine so far.

Images in PDFs produced with FreePDF are always sliced, so one image results in multiple image files.

Is there a trick to avoid this? How can I use the results of pdfshow? Are there coordinates to know the position and height and width to cut/crop the image after converting the PDF to a PNG or JPEG?

Deck answered 19/1, 2015 at 11:13 Comment(5)

Can you post a (link to a) sample PDF which produces "sliced" images? – Ruffi 19/1, 2015 at 14:53

Unfortunately I have no influence on the way how the pdf is produced. FreePdf is an example only. With most pdf I am able to extract the images in 'one piece'. My problem is how to handle the sliced images and how to get it as a 'full image'. My idea was to convert the pdf to a png and crop the image from this file. And I thought pdfshow could contain information on the position and width and height of the image. But I am not able to interpret the output in this way. This is an sample file: dropbox.com/s/dlavjithk2o9r9i/test1.pdf?dl=0 – Deck 20/1, 2015 at 9:8

Hi Kurt, thanks for the detailed explanations. I will work through this weekend. It seems to be a weird problem to to find an automated solution. Too bad that one cannot determine the position coordinates. Thanks again for your efforts, Juergen – Deck 21/1, 2015 at 10:5

Hi Kurt, your explanations lead to the core problem. However I didn't find a solution for now. Nitro pdf and pdflib merge the image correctly. I assume there is a stitching routine working in the background. Again thanks for your time! Juergen – Deck 26/1, 2015 at 17:54

Ah -- I knew that PDFlib could possibly do that. But about two years ago I tested NitroPDF with a similar problem, and it wasn't able to do it then. Good to know that it improved so much meanwhile! – Ruffi 26/1, 2015 at 18:0

The most likely reason why your images are "sliced" after extracting them is this: they were "sliced" already before extracting them -- as their way of living within the PDF file itself.

Don't ask me why some PDF generating software does this.

MS Powerpoint is infamous for this -- background images showing some gradient often get sliced up into tens of thousands of 1x1, 1x2 or 1x8 pixels and similarly-sized mini images inside the PDF.

Update

1. Identify the scope of the problem

The image fragments of the sample PDF can be identified with the pdfimages -list command (this requires a recent version of pdfimages based on the Poppler fork, not the xpdf one!):

pdfimages -list so-28023312-test1.pdf

page   num  type   width height color comp bpc  enc interp objectID x-ppi y-ppi size ratio
------------------------------------------------------------------------------------------
   1     0 image     271   271  rgb     3   8  jpeg   no       18 0   163   163 26.7K  12%
   1     1 image     271   271  rgb     3   8  jpeg   no       19 0   163   163 21.7K  10%
   1     2 image     271   271  rgb     3   8  jpeg   no       30 0   163   163 22.9K  11%
   1     3 image     271   271  rgb     3   8  jpeg   no       31 0   163   163 21.8K  10%
   1     4 image     132   271  rgb     3   8  jpeg   no       32 0   162   163 9895B 9.2%
   1     5 image     271   271  rgb     3   8  jpeg   no       33 0   163   163 22.5K  10%
   1     6 image     271   271  rgb     3   8  jpeg   no       34 0   163   163 16.5K 7.7%
   1     7 image     271   271  rgb     3   8  jpeg   no       35 0   163   163 16.9K 7.9%
   1     8 image     271   271  rgb     3   8  jpeg   no       36 0   163   163 20.3K 9.4%
   1     9 image     132   271  rgb     3   8  jpeg   no       37 0   162   163 14.5K  14%
   1    10 image     271   271  rgb     3   8  jpeg   no       20 0   163   163 17.1K 8.0%
   1    11 image     271   271  rgb     3   8  image  no       21 0   163   163  107K  50%
   1    12 image     271   271  rgb     3   8  image  no       22 0   163   163 96.7K  45%
   1    13 image     271   271  rgb     3   8  image  no       23 0   163   163  119K  56%
   1    14 image     132   271  rgb     3   8  jpeg   no       24 0   162   163 10.7K  10%
   1    15 image     271    99  rgb     3   8  jpeg   no       25 0   163   161 7789B 9.7%
   1    16 image     271    99  rgb     3   8  jpeg   no       26 0   163   161 6456B 8.0%
   1    17 image     271    99  rgb     3   8  jpeg   no       27 0   163   161 7202B 8.9%
   1    18 image     271    99  rgb     3   8  jpeg   no       28 0   163   161 8241B  10%
   1    19 image     132    99  rgb     3   8  jpeg   no       29 0   162   161 5905B  15%

Because there are only 20 different fragments on 1 page, it is easy to...

...first extract them all and convert them to JPEGs, and
...then stitch them together again.

2. Extract the fragments as JPEGs

The following command will extract the fragments and try to save them as JPEGs (-j) 28023312:

pdfimages so-28023312-test1.pdf 28023312

There are 3 images which came out as PPM. Use ImageMagick's convert to make JPEGs from them (not strictly required, but it simplifies the 'stitching' command line:

for i in 11 12 13; do
  convert 28023312-0${i}.ppm 28023312-0${i}.jpg
done

Here are the first three fragments, 280233312-000.jpg, 280233312-001.jpg and 280233312-002.jpg:

3. Stitch the 20 fragments together again

ImageMagick can stitch the 20 images together again. When looking at the PDF page as well as at the 20 JPEGs it is easy to determine the order they need to be put together:

convert                                         \
   \( 28023312-0{00,01,02,03,04}.jpg +append \) \
   \( 28023312-0{05,06,07,08,09}.jpg +append \) \
   \( 28023312-0{10,11,12,13,14}.jpg +append \) \
   \( 28023312-0{15,16,17,18,19}.jpg +append \) \
 -append                                        \
  complete.jpg

Dissecting the command:

The +append image operator appends all listed images in a horizontal order.
The \( ... \) lines indicate an 'aside' processing of the resprective part of the image stack (which needs to be separated by the escaped parentheses). The result of this horizontal appending operation will then replace the individual fragments inside the current image stack.
The final -append image operator appends the current images vertically.

Here is the resulting JPEG, fully stitched together again:

Stitched together: final image

Could this be automated?

In theory we could automate this process. For this we would have to analyse the PDF source code. However, that is rather difficult, because the content stream may be compressed.

In order to uncompress all or most of the content streams and to get a nicer representation of the PDF file structure, we could use mutool clean -d, podofouncompress or qpdf --qdf.

I prefer qpdf, the 'structural, content-preserving PDF file transformer'. Here is the command:

qpdf --qdf --object-streams=disable so-28023312-test1.pdf qdf.pdf

The resulting PDF file, qdf.pdf is more easy to analyse, because most (but not all) previously binary sections are now in ASCII. When you search for the occurrences of Do inside this file, you will see where images are inserted (however, I cannot give you a complete PDF analysing tutorial here, sorry...).

The following command prints all lines where Do occurs, plus the preceding line (-B 1):

grep -a -B 1 " Do" qdf.pdf

1002 0 0 1002 236 5776.67 cm
/Im0 Do
--
1001 0 0 1002 1237 5776.67 cm
/Im1 Do
--
120.12 0 0 120.24 268.44 693.2004 cm
/Im2 Do
--
[...skipping 15 other output segments...]
--
1002 0 0 369 3237 3406.67 cm
/Im18 Do
--
490 0 0 369 4238 3406.67 cm
/Im19 Do
--
1 0 0 1 204.9037018 508.5130005 cm
/Fm0 Do

All the /ImNN Do lines insert images (the /Fm0 Do line refers to a form object not an image).

The preceding lines, for example 490 0 0 369 4238 3406.67 cm set up the current transformation matrix. From this line alone, one can sometimes conclude the position of the image and its size. In the case of this file, it is not enough -- the contents of more preceding lines would be required in order to determine the current 'drawing position'.

Ruffi answered 19/1, 2015 at 14:56 Comment(2)

The original reason for this as I heard it years ago from someone from Adobe was to support images with features not supported in the (then less complete) PDF format. And he referred mainly to transparency. I have a feeling that's a bit limited in terms of explanation but for what it's worth :) – Demp 19/1, 2015 at 15:8

The bad news about calling pdfimages.exe with -j is that it rarely inverts colors in black & white images. I therefore use it without -j, and then convert the output ppm images to bmp format via avconv.exe. As a side note, this post is really thorough, thank you for the effort. – Syllabary 30/10, 2017 at 1:59

FreePDF uses Ghostscript and creates a 'virtual printer'. When you 'print to PDF' what actually happens is that your application prints to the Windows print pipeline, which sends the graphics primitives to the Windows PostScript printer driver, which sends the PostScript to the Port Monitor. The FreePDF Port Monitor stores this PostScript program on disk. When the output is complete, it starts up Ghostscript which interprets the PostScript and produces a PDF file.

Now, unless you are using a spectacularly old version of Ghostscript (which is possible, you should check!) this will take whatever was in the input and put it in the output. It won't slice up images.

Which means that, as Kurt and David said above, that the real reason for the problem is that the PostScript program has sliced up images in it, before Ghostscript ever saw it.

Now I know that's not generally the case, but it depends heavily on what PostScript printer driver you have installed, how its configured, what version of Windows you are using and what the application driving the printer is.

As David rightly says, Microsoft Office applications have a bad habit of drawing certain kinds of patterns this way (to get a 'translucent effect' they use a pattern where the cell is an imagemask, the 'white' pixels are transparent).

Also if you have large photographs (for example) and the PostScript printer is configured with minimal memory, the driver might split up the image in order not to exhaust the printer's memory. Obviously that's a configuration problem because on a desktop PC you would have to be using monster images to overwhelm Ghostscript.

So basically, we need a lot more information from you before we can answer this fully, but the principle is that the damage was done before it got to FreePDF. The version of Ghostscript used to create the PDF file will be in the PDF file metadata, unless FreePDF chose to erase/overwrite it.

Finally, as Kurt pointed out, you should post a link to the PDF file, and ideally the application file and intermediate PostScript file which was used to produce the PDF.

Noguchi answered 19/1, 2015 at 15:24 Comment(0)

Update

1. Identify the scope of the problem

2. Extract the fragments as JPEGs

3. Stitch the 20 fragments together again

Could this be automated?

Recommended topics

Hot tags