The most likely reason why your images are "sliced" after extracting them is this: they were "sliced" already before extracting them -- as their way of living within the PDF file itself.
Don't ask me why some PDF generating software does this.
MS Powerpoint is infamous for this -- background images showing some gradient often get sliced up into tens of thousands of 1x1
, 1x2
or 1x8
pixels and similarly-sized mini images inside the PDF.
Update
1. Identify the scope of the problem
The image fragments of the sample PDF can be identified with the pdfimages -list
command (this requires a recent version of pdfimages
based on the Poppler fork, not the xpdf
one!):
pdfimages -list so-28023312-test1.pdf
page num type width height color comp bpc enc interp objectID x-ppi y-ppi size ratio
------------------------------------------------------------------------------------------
1 0 image 271 271 rgb 3 8 jpeg no 18 0 163 163 26.7K 12%
1 1 image 271 271 rgb 3 8 jpeg no 19 0 163 163 21.7K 10%
1 2 image 271 271 rgb 3 8 jpeg no 30 0 163 163 22.9K 11%
1 3 image 271 271 rgb 3 8 jpeg no 31 0 163 163 21.8K 10%
1 4 image 132 271 rgb 3 8 jpeg no 32 0 162 163 9895B 9.2%
1 5 image 271 271 rgb 3 8 jpeg no 33 0 163 163 22.5K 10%
1 6 image 271 271 rgb 3 8 jpeg no 34 0 163 163 16.5K 7.7%
1 7 image 271 271 rgb 3 8 jpeg no 35 0 163 163 16.9K 7.9%
1 8 image 271 271 rgb 3 8 jpeg no 36 0 163 163 20.3K 9.4%
1 9 image 132 271 rgb 3 8 jpeg no 37 0 162 163 14.5K 14%
1 10 image 271 271 rgb 3 8 jpeg no 20 0 163 163 17.1K 8.0%
1 11 image 271 271 rgb 3 8 image no 21 0 163 163 107K 50%
1 12 image 271 271 rgb 3 8 image no 22 0 163 163 96.7K 45%
1 13 image 271 271 rgb 3 8 image no 23 0 163 163 119K 56%
1 14 image 132 271 rgb 3 8 jpeg no 24 0 162 163 10.7K 10%
1 15 image 271 99 rgb 3 8 jpeg no 25 0 163 161 7789B 9.7%
1 16 image 271 99 rgb 3 8 jpeg no 26 0 163 161 6456B 8.0%
1 17 image 271 99 rgb 3 8 jpeg no 27 0 163 161 7202B 8.9%
1 18 image 271 99 rgb 3 8 jpeg no 28 0 163 161 8241B 10%
1 19 image 132 99 rgb 3 8 jpeg no 29 0 162 161 5905B 15%
Because there are only 20 different fragments on 1 page, it is easy to...
- ...first extract them all and convert them to JPEGs, and
- ...then stitch them together again.
2. Extract the fragments as JPEGs
The following command will extract the fragments and try to save them as JPEGs (-j
) 28023312:
pdfimages so-28023312-test1.pdf 28023312
There are 3 images which came out as PPM. Use ImageMagick's convert
to make JPEGs from them (not strictly required, but it simplifies the 'stitching' command line:
for i in 11 12 13; do
convert 28023312-0${i}.ppm 28023312-0${i}.jpg
done
Here are the first three fragments, 280233312-000.jpg, 280233312-001.jpg and 280233312-002.jpg:
3. Stitch the 20 fragments together again
ImageMagick can stitch the 20 images together again. When looking at the PDF page as well as at the 20 JPEGs it is easy to determine the order they need to be put together:
convert \
\( 28023312-0{00,01,02,03,04}.jpg +append \) \
\( 28023312-0{05,06,07,08,09}.jpg +append \) \
\( 28023312-0{10,11,12,13,14}.jpg +append \) \
\( 28023312-0{15,16,17,18,19}.jpg +append \) \
-append \
complete.jpg
Dissecting the command:
The +append
image operator appends all listed images in a horizontal order.
The \( ... \)
lines indicate an 'aside' processing of the resprective part of the image stack (which needs to be separated by the escaped parentheses). The result of this horizontal appending operation will then replace the individual fragments inside the current image stack.
The final -append
image operator appends the current images vertically.
Here is the resulting JPEG, fully stitched together again:
Could this be automated?
In theory we could automate this process. For this we would have to analyse the PDF source code. However, that is rather difficult, because the content stream may be compressed.
In order to uncompress all or most of the content streams and to get a nicer representation of the PDF file structure, we could use mutool clean -d
, podofouncompress
or qpdf --qdf
.
I prefer qpdf, the 'structural, content-preserving PDF file transformer'. Here is the command:
qpdf --qdf --object-streams=disable so-28023312-test1.pdf qdf.pdf
The resulting PDF file, qdf.pdf
is more easy to analyse, because most (but not all) previously binary sections are now in ASCII. When you search for the occurrences of Do
inside this file, you will see where images are inserted (however, I cannot give you a complete PDF analysing tutorial here, sorry...).
The following command prints all lines where Do
occurs, plus the preceding line (-B 1
):
grep -a -B 1 " Do" qdf.pdf
1002 0 0 1002 236 5776.67 cm
/Im0 Do
--
1001 0 0 1002 1237 5776.67 cm
/Im1 Do
--
120.12 0 0 120.24 268.44 693.2004 cm
/Im2 Do
--
[...skipping 15 other output segments...]
--
1002 0 0 369 3237 3406.67 cm
/Im18 Do
--
490 0 0 369 4238 3406.67 cm
/Im19 Do
--
1 0 0 1 204.9037018 508.5130005 cm
/Fm0 Do
All the /ImNN Do
lines insert images (the /Fm0 Do
line refers to a form object not an image).
The preceding lines, for example 490 0 0 369 4238 3406.67 cm
set up the current transformation matrix. From this line alone, one can sometimes conclude the position of the image and its size. In the case of this file, it is not enough -- the contents of more preceding lines would be required in order to determine the current 'drawing position'.