pdftk will not decompress data streams
Asked Answered
G

2

11

I have been trying to work with pdftk to inspect information from compressed pdf streams created by Nitro Reader, but pdftk will not deflate the streams. It produces no errors, but it does not seem to do anything beyond reordering the pdf objects. Here is a minimal example of one of these pdfs.

    pdftk test.pdf output test-d.pdf uncompress

When I try pdftk on other pdfs, it seems to work fine. If I manually extract the data streams and decompress them using zlib in Python, they decompress properly. Also, if I open the pdf in Adobe Reader and re-save, pdftk works fine on the resulting pdf.

I have manually inspected the Nitro pdf to the best of my ability, and it seems to be a valid pdf. I am very confused as to what is going on here.

As background to the problem, I have hundreds of these pdfs, and I am trying search for certain keywords, which I should be able to do if I can automate the decompression.

pdftk version 1.45
Windows 7 Home Premium SP1
Nitro Reader 2 version 2.5.0.36

Thanks, James

Godforsaken answered 25/2, 2013 at 0:3 Comment(0)
G
3

I received an answer to this question from the developer. It turned out to be a bug in the way pdftk handled a /DecodeParms [null] line.

If the decode parameters are null, a writer could just omit the /DecodeParms line, but a compliant reader should understand it either way. I tried out the new version of pdftk and the problem seems to be solved.

Godforsaken answered 26/8, 2013 at 6:40 Comment(0)
G
15

If you are not attached to pdftk, you can use qpdf. For instance, you could use:

$ qpdf --stream-data=uncompress input.pdf output.pdf

For what it is worth, if there are blobs, they still might appear as binary. Although, the rest of the stream will be uncompressed (either with pdftk or qpdf). qpdf allows you to uncompress all or only the streams.

From qpdf manual:

When --stream-data=uncompress is specified, qpdf will attempt to remove any non-lossy filters that it supports. This includes /FlateDecode, /LZWDecode, /ASCII85Decode, and /ASCIIHexDecode. This can be very useful for inspecting the contents of various streams.

The same could happen with pdftk.

Gretagretal answered 22/3, 2013 at 22:59 Comment(1)
Is there a way to do this with gs (GhostScript)?Disorient
G
3

I received an answer to this question from the developer. It turned out to be a bug in the way pdftk handled a /DecodeParms [null] line.

If the decode parameters are null, a writer could just omit the /DecodeParms line, but a compliant reader should understand it either way. I tried out the new version of pdftk and the problem seems to be solved.

Godforsaken answered 26/8, 2013 at 6:40 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.