How to decode a PDF stream?

Asked 17/1, 2015 at 9:11 Answered 22/2, 2023 at 22:4

Solved pdf adobe reverse-engineering malware exploit

I want to analyze a stream object in a PDF file which is encoded using /FlateDecode.

Are there any tools which allow one to decode such encoding (ASCII85decode, LZWDecode, RunlenghtDecode etc.) used in PDFs?

The stream content is most likely a PE file structure, which the PDF probably will use later in the exploit.

Also, there are two xref tables in the PDF, that is alright but also two %%EOF which follow the xref.

Is the presence of these allright? (Note: The second xref points to the 1st xref using the /prev name.

this xref refers to the second xref:

xref 
5 6
0000000618 00000 n
0000000658 00000 n
0000000701 00000 n
0000000798 00000 n
0000045112 00000 n
0000045219 00000 n
1 1
0000045753 00000 n
3 1
0000045838 00000 n
trailer
>
startxref
46090
%%EOF

the second xref:

xref
0 5
0000000000 65535 f
0000000010 00000 n
0000000067 00000 n
0000000136 00000 n
0000000373 00000 n
trailer
>
startxref
429
%%EOF

Arondell answered 17/1, 2015 at 9:11 Comment(0)

"Two xref tables and two %%EOF"?

This alone is not an indication of a malicious PDF file. There can by two or even more instances of each, if the file was generated via the "incremental update" feature. (Each digitally signed PDF file is like that, and each file which was changed in Acrobat and saved by using the 'Save' button/menu instead of the 'Save as...' button/menu is like that too.)
"How to decode a compressed PDF stream from a specific object"?

Have a look at Didier Stevens' Python script pdf-parser.py. With this command line tool, you can dump the decoded stream of any PDF object into a file. Example command to dump the stream of PDF object number 13:
```
pdf-parser.py -o 13 -f -d obj13.dump my.pdf
```

Father answered 17/1, 2015 at 21:32 Comment(5)

"Each digitally signed PDF file is like that" - Not necessarily. Only if changes are added after signing without breaking the signature, an incremental update is strictly necessary. – Haynie 18/12, 2017 at 20:13

@mkl: can you show me an example of a signed PDF, which you added changes to after signing, and where these changes doe not break the signature? – Father 18/12, 2017 at 20:17

Easily, take for example PDFs with two valid integrated signatures. Adding another signature to an already signed document obviously is such a change after signing. E.g. see this SD DSS example file. – Haynie 18/12, 2017 at 22:43

pdf-parser.py worked for me. gist.github.com/averagesecurityguy/… is similar but did give me errors on some pdf files. you can compress the stream with zlib. – Caudle 23/2, 2019 at 19:44

This gives me "Unsupported Filter : [/FlateDecode /DCTDecode]" and "Unsupported Filter: ['/JBIG2Decode'] Errors – Rickie 15/6, 2022 at 6:56

A %%EOF comment should be present at the end of the file, any other comments (any line beginning %) may be present at any point in the file. So yes, 2 %%EOF comments is perfectly valid. This is documented in the PDF Reference. Check example 3.11 in the 1.7 PDF Reference Manual on page 112 for a documented example in the specification which has the structure you describe. This is a PDF file which has been incrementally updated.

Note that more recent versions of PDF can have cross reference streams, which are themselves compressed.

The easiest way to decode a PDF file is to use a tool intended to do it, for example MuPDF can do this with "mutool clean -d <input pdf file> <output PDF file>" will decompress (-d) all the compressed streams in a PDF file and write the output to a new PDF file.

Otherwise you will need to use something like zlib for Flate and LZW decompression, you will need to write your own RunLength decompression as well as ASCIIHex85 I think. Not to mention JBIG, JPEG and JPEG2000 if you want the images decoded too.

Obscene answered 17/1, 2015 at 20:0 Comment(1)

Some rather stupid guys downvoted the OP question and also voted to close it. Please upvote the question to balance this out... – Father 17/1, 2015 at 21:34

You can use RUPS to analyze the PDF and export or just look at the stream already decoded. About the %%EOF you can have as many as the number of appends made to the PDF.

Shaniceshanie answered 17/1, 2015 at 20:6 Comment(2)

Thankyou Paulo for answering – Arondell 17/1, 2015 at 20:7

Some rather stupid guys downvoted the OP question and also voted to close it. Please upvote the question to balance this out... – Father 17/1, 2015 at 21:33

With regards to tools, as stated in other answers there are a number of tools that can be used to decompress streams (on the command-line or otherwise). However, there are also a number of tools that make it easy to inspect a PDF file by allowing you to walk the object tree and see what inside compressed streams easily. The two I've used are:

1) callas pdfToolbox Desktop (caution, I'm associated with this company). pdfToolbox has an "Explore PDF" option that allows you to see the objects associated with a page, up to and including the actual page operators.

2) Enfocus Browser. This tool will allow you to open the root of the object tree of a PDF file and then present the object hierarchy in a way very similar to the Finder on Mac does with file systems. Browser will even allow you to edit PDF files (you should really know what you're doing in this case) by editing the low-level objects, create new objects or change the content of streams. Really cool.

It was pointed out to me that Enfocus Browser is no longer available as I said in the previous version of my answer, but actually it is. You just need to create an Enfocus account in order to download it from here: https://www.enfocus.com/en/support/downloads/old-product-installers

Billie answered 17/1, 2015 at 22:53 Comment(0)

There is another scenario where you can have two %%EOF's where the document may not necessarily be incrementally updated.

According to Annex F of the official ISO 32000-1:2008 PDF (1.7) standard, which details the internals of a 'Lineraized PDF' : There are 2 %%EOFs in the file. The first occurs at the beginning, just after the Linearization Parameter Dictionary. That section is known as the 'First Page Cross-Reference Trailer'.

Quoting from this file :

The first-page trailer shall contain valid Size and Root entries, as well as any other entries needed to display the document. The Size value shall be the combined number of entries in both the first-page cross-reference table and the main cross-reference table. The first-page trailer may optionally end with startxref, an integer, and %%EOF, just as in an ordinary trailer. This information shall be ignored

Daye answered 17/5, 2020 at 6:26 Comment(0)

on linux you can use mutool, which comes in the mupdf-tools package. running:

mutool clean -d inputfile.pdf out.pdf

will create file out.pdf with all streams decoded. mutool can also extact and decode individual streams with the show command but i havent used that

Killough answered 22/2, 2023 at 22:4 Comment(0)

Recommended topics

Hot tags