Is it possible to uncompress PDF by using Adobe Acrobat or Acrobat Distiller?

S

3

23

Most PDF files found on the Web have compressed and unreadable data streams. Is it possible to uncompress the internal content of a PDF file using Acrobat or Acrobat Distiller, allowing us to read the source code by a text editor?

P.S. This question is inspired by this answer which explains how it can be done with GhostScript.

Sexology answered 15/9, 2013 at 13:59 Comment(3)

What do you want to read in the editor? The operators used to draw something? Or also the text? – Jillion 16/9, 2013 at 4:45

@Jillion I want to read the operators used to draw vector figures. – Sexology 16/9, 2013 at 4:52

While I don't see how to do that using Acrobat (I only have version 9.5 at my hands, though), it is fairly easy to do that in a small Java or .Net program using iText or iTextSharp by reading a PDF and re-saving it without compression, cf. the method decompressPdf in HelloWorldCompression.java / HelloWorldCompression.cs. – Jillion 16/9, 2013 at 8:31

E

7

This is easy with qpdf and pdftk.

With Adobe Acrobat you can get at the internal structure after profiling a PDF (preflight with some profile (e.g. detect PDF syntax errors), then Options->Internal PDF structure) - but there's no way to get something editable with a text editor.

Exodontist answered 15/9, 2013 at 21:5 Comment(6)

I need to covert a PDF into something readable with a text editor. Is it possible with Acrobat? – Sexology 16/9, 2013 at 4:41

@AlexeyPopkov: You can export into e.g. XML. But editable: no. – Samp 16/9, 2013 at 6:27

Exporting to XML gives result similar to exporting to TXT: only textual elements are included. I need to read the operators used to draw vector figures in the PDF. – Sexology 16/9, 2013 at 6:40

+1 Thanks for Options->Internal PDF structure in Preflight. It would be ideal to copy its content to a text editor for further investigation. BTW, there is no need for profiling to see Internal PDF structure: it works from the start (at least in Acrobat 11). – Sexology 16/9, 2013 at 10:42

@AlexeyPopkov: " I need to read the operators used to draw vector figures in the PDF". In that case look for uncompressed /Contents objects and their streams. Inside the expanded streams, also look for /name Do operations -- these may point to XRef objects named /name containing vector elements (as well as point to raster image objects). – Apc 7/5, 2015 at 18:46

For a given *.pdf file Acrobat Pro DC provides and Export To Function that provides a variety of alternative formats, one of which is PostScript, PostScript is the only likely option that would provide the operators. However, I haven't used PostScript, except as a stand-alone language, since shortly after it was first invented. A quick glance at the output for one page shows export provides ASCII readable Postscript output. If one can simulate/interpret the operators such as "pop", "{get exec}bdf" etc, this might be as close as you get to the code generating vector or raster graphics. – Thingumajig 15/2, 2019 at 14:49

A

27

qpdf and pdftk have already been mentioned. To show the commands:

$ qpdf --qdf --object-streams=disable orig.pdf uncompressed-orig.pdf
$ pdftk orig.pdf output uncompressed-orig.pdf uncompress

mutool however hasn't been mentioned yet:

$ mutool clean -d -a orig.pdf uncompressed-orig.pdf

mutool is a command line tool which ships alongside the lightweight MuPDF PDF + document viewer.

I do not think you can achieve the uncompressing of PDF objects' streams with Acrobat or Distiller (unless you have additional payware plugins available).

Apc answered 7/5, 2015 at 16:32 Comment(6)

Are you sure that for qpdf the option --object-streams=disable is a good choice? According to the documentation this option means "don't write any object streams." Will not the streams be erased as a result? – Sexology 7/5, 2015 at 16:56

@AlexeyPopkov: Yes, I'm pretty sure it is a good choice for the purpose. I'm using it daily. IF object streams are enabled, a lot of the smaller objects will be embedded into another object's stream, which makes it more complex to analyse, even if un-compressed. If you don't believe me, try it yourself. (You need an input file that has at least 1 object of /Type /ObjStm). Disabling object-streams will unpack all these streamed objects and put them properly into their own indirect objects again, individually. – Apc 7/5, 2015 at 17:8

Do you mean that for qpdf seemingly obvious choice --stream-data=uncompress will change the structure of file and complicate it? – Sexology 7/5, 2015 at 17:15

@AlexeyPopkov: The --qdf mode already implicitely implies --stream-data=uncompress. And yes, using QPDF does change the structure of the file in some way. But it tries to do so in a content-preserving way. The self-description of QPDF even tells so, stating it being a "CLI tool that does structural, content-preserving transformations on PDF files". (In which cases the contents change in an unwanted and unexpected way is a different matter. I've filed a few bug reports/enhancement requests about these: for example OCGs ("layers") get flattend and incremental update history gets lost.) – Apc 7/5, 2015 at 17:36

From the QPDF documentation it looks like that the --qdf mode creates a very-very special version of PDF file which is editable what is not supposed by developers of PDF and for this reason the --qdf mode can expectedly corrupt the original file in some way. I appreciate this effort but I'm still unsure whether the --qdf mode gives any benefits for readability of the PDF code (in this thread I'm not interested in editability). – Sexology 7/5, 2015 at 18:6

@AlexeyPopkov: It's good U read the docu before starting 2 use QPDF; I did the same, back in the days. Feel free 2 do whatever you want. I'm just sharing my knowledge + experience here. I hope you'll do the same once you learned + know more (or other) things about PDFs + related tools than I do. Whatever you finally decide for as a tool to give you the readability of PDF code: you have to compare each of it against the others first. I really hope you'll put up a writeup somewhere on the 'Net describing + weighing advantages as well as disadvantages of each tool. I'd be your first reader !! – Apc 7/5, 2015 at 18:44

G

18

Use cpdf:

cpdf -decompress in.pdf -o out.pdf

and then the graphic operators for each page can be read in a text editor. You'll need a copy of the standard as a reference, though.

Disclosure: I am the author of cpdf.

Goldarned answered 16/9, 2013 at 10:34 Comment(0)

E

7

This is easy with qpdf and pdftk.

With Adobe Acrobat you can get at the internal structure after profiling a PDF (preflight with some profile (e.g. detect PDF syntax errors), then Options->Internal PDF structure) - but there's no way to get something editable with a text editor.

Exodontist answered 15/9, 2013 at 21:5 Comment(6)

I need to covert a PDF into something readable with a text editor. Is it possible with Acrobat? – Sexology 16/9, 2013 at 4:41

@AlexeyPopkov: You can export into e.g. XML. But editable: no. – Samp 16/9, 2013 at 6:27

Exporting to XML gives result similar to exporting to TXT: only textual elements are included. I need to read the operators used to draw vector figures in the PDF. – Sexology 16/9, 2013 at 6:40

+1 Thanks for Options->Internal PDF structure in Preflight. It would be ideal to copy its content to a text editor for further investigation. BTW, there is no need for profiling to see Internal PDF structure: it works from the start (at least in Acrobat 11). – Sexology 16/9, 2013 at 10:42

@AlexeyPopkov: " I need to read the operators used to draw vector figures in the PDF". In that case look for uncompressed /Contents objects and their streams. Inside the expanded streams, also look for /name Do operations -- these may point to XRef objects named /name containing vector elements (as well as point to raster image objects). – Apc 7/5, 2015 at 18:46

For a given *.pdf file Acrobat Pro DC provides and Export To Function that provides a variety of alternative formats, one of which is PostScript, PostScript is the only likely option that would provide the operators. However, I haven't used PostScript, except as a stand-alone language, since shortly after it was first invented. A quick glance at the output for one page shows export provides ASCII readable Postscript output. If one can simulate/interpret the operators such as "pop", "{get exec}bdf" etc, this might be as close as you get to the code generating vector or raster graphics. – Thingumajig 15/2, 2019 at 14:49

Recommended topics

Hot tags