How can I visually inspect the structure of a PDF to reverse engineer it? [closed]

Asked 23/8, 2010 at 16:22 Answered 17/12, 2018 at 13:16

147

How can I inspect the structure of PDF files?

Use case: I'm trying to programmatically generate PDF files (using iText). I'm having trouble achieving certain layouts, but I have PDF files with text laid out the way I want (generated from Word). I would like to reverse engineer how they do it.

PDF Inspector seems to be good, but I'm looking for something for Windows.

Nador answered 23/8, 2010 at 16:22 Comment(5)

PDF Inspector is Java based, so multiplatform. – Trimorphism 21/4, 2017 at 10:50

Doesn't seem to run on Windows though. The jar doesn't do anything when clicked on. When called at the command line I get no main manifest attribute, in PDF Document Inspector.jar – Menell 21/4, 2017 at 18:29

@Trimorphism it's java based but apple wrapped so it's kinda apple only distribution. There is "PDF Document Inspector.app/Contents/Resources/Java/PDF Document Inspector.jar" jar but it's not startable as java -jar "PDF Document Inspector.jar" Also there is lot of com.apple.cocoa.* includes that are platform specific. :( – Torin 13/11, 2019 at 11:3

I'm using now successfully iText Rups, multiplatform and Java based. – Trimorphism 13/11, 2019 at 11:12

Unfortunately I can't add an answer since the question is closed, but after much searching I finally found this tool: brendandahl.github.io/pdf.js.utils/browser (using pdf.js under the hood to inspect the structure of your pdf). I've had a lot of success reverse engineering pdfs with this page. – Yingyingkow 9/3, 2023 at 21:20

Adobe Acrobat has a very cool but rather well hidden mode allowing you to inspect PDF files. I wrote a blog article explaining it at https://blog.idrsolutions.com/2009/04/viewing-pdf-objects/

Cistern answered 24/8, 2010 at 6:41 Comment(7)

This seems to require a plugin; at least it's not available in Acrobat Reader 9.5.5 on Linux. – Schleswig 9/12, 2014 at 22:44

@AdamSpiers, preflight dialog box is a feature of Adobe Acrobat, not Adobe Reader – Tripetalous 26/3, 2015 at 13:20

... and Acrobat (formerly Acrobat Exchange) is not available for Linux :-/ – Schleswig 26/3, 2015 at 13:32

Preflight dialog box actually requires Adobe Acrobat Pro. It is not available in Adobe Acrobat Standard. – Hersch 26/6, 2018 at 20:35

And it is a UI nightmare to actually use. – Miscreant 7/1, 2020 at 22:40

Well we do not use Adobe Acrobat - so how to inspect the PDF without it? – Fechner 19/3, 2020 at 18:37

I know this is a very old thread, but I found an online PDF inspector, which allows you to browse the PDF structure in a way very similar to how Adobe does it. It is slightly less powerful than Adobe, but it's free and online, so might still be useful for somebody… – Consumption 11/3, 2021 at 10:23

140

Besides the GUI-based tools mentioned in the other answers, there are a few command line tools which can transform the original PDF source code into a different representation which lets you inspect the (now modified file) with a text editor. All of the tools below work on Linux, Mac OS X, other Unix systems or Windows.

`qpdf` (my favorite)

Use qpdf to uncompress (most) object's streams and also dissect ObjStm objects into individual indirect objects:

qpdf --qdf --object-streams=disable orig.pdf uncompressed-qpdf.pdf

qpdf describes itself as a tool that does "structural, content-preserving transformations on PDF files".

Then just open + inspect the uncompressed-qpdf.pdf file in your favorite text editor. Most of the previously compressed (and hence, binary) bytes will now be plain text.

`mutool`

There is also the mutool command line tool which comes bundled with the MuPDF PDF viewer (which is a sister product to Ghostscript, made by the same company, Artifex). The following command does also uncompress streams and makes them more easy to inspect through a text editor:

mutool clean -d orig.pdf uncompressed-mutool.pdf

`podofouncompress`

PoDoFo is an FreeSoftware/OpenSource library to work with the PDF format and it includes a few command line tools, including podofouncompress. Use it like this to uncompress PDF streams:

podofouncompress orig.pdf uncompressed-podofo.pdf

`peepdf.py`

PeePDF is a Python-based tool which helps you to explore PDF files. Its original purpose was for research and dissection of PDF-based malware, but I find it useful also to investigate the structure of completely benign PDF files.

It can be used interactively to "browse" the objects and streams contained in a PDF.

I'll not give a usage example here, but only a link to its documentation:

peepdf - PDF Analysis Tool

`pdfid.py` and `pdf-parser.py`

pdfid.py and pdf-parser.py are two PDF tools by Didier Stevens written in Python.

Their background is also to help explore malicious PDFs -- but I also find it useful to analyze the structure and contents of benign PDF files.

Here is an example how I would extract the uncompressed stream of PDF object no. 5 into a *.dump file:

pdf-parser.py -o 5 -f -d obj5.dump my.pdf

Final notes

Please note that some binary parts inside a PDF are not necessarily uncompressible (or decode-able into human readable ASCII code), because they are embedded and used in their native format inside PDFs. Such PDF parts are JPEG images, fonts or ICC color profiles.
If you compare above tools and the command line examples given, you will discover that they do NOT all produce identical outputs. The effort of comparing them for their differences in itself can help you to better understand the nature of the PDF syntax and file format.

Triturate answered 6/4, 2015 at 15:37 Comment(2)

Any idea how I can inspect a JBIG2 Stream? E.g. a Stream that uses Filter "/Jbig2decode"? They are sadly still unreadable using these methods – Kenyettakenyon 15/6, 2022 at 8:15

For mutool I recommend adding -c, so mutool clean -c -d orig.pdf uncompressed-mutool.pdf, so that each instruction in the content stream will be on a separate line so it's easier to read. – Raby 19/11, 2022 at 2:15

I use iText RUPS(Reading and Updating PDF Syntax) in Linux. Since it's written in Java, it works on Windows, too. You can browse all the objects in PDF file in a tree structure. It can also decode Flate encoded streams on-the-fly to make inspecting easier.

Here is a screenshot:

iText RUPS screenshot

Shrewd answered 3/6, 2012 at 10:1 Comment(17)

java -jar itext-rups-5.5.6.jar -> Exception in thread "AWT-EventQueue-0" java.lang.NoClassDefFoundError: com/itextpdf/text/Version - How are you supposed to run this thing? Edit: Figured it out. You should not download the default file offered by SourceForge, you need to download the .jar which includes dependencies. – Vying 13/7, 2015 at 0:52

@Vying just came across the same thing. Thanks for your comment. – Dzungaria 13/7, 2015 at 6:33

@Zero3: You should no longer download from SF at all... – Triturate 29/9, 2015 at 11:12

@KurtPfeifle I completely agree. Unfortunately, a lot of software (like this!) is only available through SourceForge because the maintainer did not move the project elsewhere yet, and might never do so. You should indeed be very careful when downloading anything from SourceForge these days... – Vying 29/9, 2015 at 12:57

@Vying at the time you wrote that comment, all iText related software, including RUPS, was already on GitHub for more than 6 months. There is also the official iText website, itextpdf.com – Mythology 11/3, 2016 at 7:35

@Vying the release of iText 5.5.9 is scheduled for next week and might not be offered on Sourceforge. I will put up a notice to tell people where we have moved. Unfortunately that will make some other people unhappy, but you cannot please all of the people all of the time. – Mythology 11/3, 2016 at 7:38

@AmedeeVanGasse Great! I was not aware, as I just followed the link by gkcn. I'm not sure what you mean with making other people unhappy. Abandoning SourceForge seems like the only sensible thing to do. – Vying 11/3, 2016 at 10:44

There are tons of ancient links all over the web, also on StackOverflow, that point to Sourceforge. If they point to the main project page, then it's okay and they will see the notice that I will put up. But if it is a deep link to a specific file on a specific commit, and I remove that, then people will get a 404. – Mythology 11/3, 2016 at 11:0

@AmedeeVanGasse is iText RUPS available as a compiled jar ready to use by non-developers? – Charlinecharlock 12/4, 2016 at 9:50

Yes - as a compiled jar and even as an exe, for Windows users. See github.com/itext/rups/releases/latest – Mythology 12/4, 2016 at 9:53

@AmedeeVanGasse the screenshot in this answer shows a view of the page (between the document tree and xref tab). How can I display that view in v5.5.9 on Windows? – Charlinecharlock 12/4, 2016 at 13:3

Please start a new question. – Mythology 12/4, 2016 at 13:9

AGPL version has no built-in renderer... – Catarinacatarrh 19/6, 2017 at 18:38

for all experiencing Exception in thread "AWT-EventQueue-0" issue try running other jar from zipfile: java -jar itext-rups-5.5.9-jar-with-dependencies.jar – Argentine 4/7, 2017 at 4:18

I found PikePDF to be an excellent way to get at QPDF’s functionality from Python. – Disentomb 9/2, 2021 at 12:49

If you get java.lang.UnsatisfiedLinkError: Can't load library: /usr/lib/jvm/java-11-openjdk-amd64/lib/libawt_xawt.so, try sudo apt-get install openjdk-11-jre – Telson 10/11, 2021 at 1:0

Current RUPS version does even allow for editing the PDF structure right from the GUI. – Wellpreserved 26/12, 2021 at 22:6

Adobe Acrobat has a very cool but rather well hidden mode allowing you to inspect PDF files. I wrote a blog article explaining it at https://blog.idrsolutions.com/2009/04/viewing-pdf-objects/

Cistern answered 24/8, 2010 at 6:41 Comment(7)

This seems to require a plugin; at least it's not available in Acrobat Reader 9.5.5 on Linux. – Schleswig 9/12, 2014 at 22:44

@AdamSpiers, preflight dialog box is a feature of Adobe Acrobat, not Adobe Reader – Tripetalous 26/3, 2015 at 13:20

... and Acrobat (formerly Acrobat Exchange) is not available for Linux :-/ – Schleswig 26/3, 2015 at 13:32

Preflight dialog box actually requires Adobe Acrobat Pro. It is not available in Adobe Acrobat Standard. – Hersch 26/6, 2018 at 20:35

And it is a UI nightmare to actually use. – Miscreant 7/1, 2020 at 22:40

Well we do not use Adobe Acrobat - so how to inspect the PDF without it? – Fechner 19/3, 2020 at 18:37

PDFXplorer from O2 Solutions does an outstanding job of displaying the internals if you're on a Windows machine.

http://www.o2sol.com/pdfxplorer/overview.htm

(Free, distracting banner at the bottom).

Ennoble answered 17/12, 2017 at 13:33 Comment(0)

If you're on Windows, PDF Analyzer is similar to PDFXplorer, but it has more options. It is also free after a single registration.

Rozella answered 17/12, 2018 at 13:16 Comment(4)

For me PDFXplorer works much better, because it goes deeper into the contents. – Sinful 17/5, 2021 at 5:9

@Sinful how do you mean, in the tree? I like the fact that PDFAnalyzer can show text and can dump images. – Rozella 18/5, 2021 at 14:27

I compared PDFxplorer and PDF Analyzer and PDFXplorer lets me dig down a bit deeper into the internal structures of the streams than PDF Analyzer. – Sinful 22/5, 2021 at 1:23

For people reading this that want to try PDF Analyzer, you don't need to register into their site just fill the names and emails with anything and click "Register my free copy" but make sure to block the application from accessing Internet through your firewall, or disable Internet while registering the application. – Angellaangelle 1/6, 2021 at 18:34

There is also another option. Adobe Acrobat Pro is also able to display the internal tree structure of the PDF.

Open Preflight
Go to Options (right upper corner)
Internal PDF Structure

On top Adobe Acrobat Pro can also display the internal structure of the Document Fonts in the PDF most of other "PDF tree structure viewer" don't have this otion

Flushing answered 23/9, 2015 at 9:15 Comment(2)

This is what @mark-stephens describes in the accepted answer. – Roomer 6/3, 2018 at 13:35

@mark-stephens' answer just links to a blog post that might disappear in the future (and is discouraged on SO). vadimo's actually provides the answer. – Battles 26/12, 2018 at 18:24

I've used PDFBox with good success. Here's a sample of what the code looks like (back from version 0.7.2), that likely came from one of the provided examples:

// load the document
System.out.println("Reading document: " + filename);
PDDocument doc = null;                                                                                                                                                                                                          
doc = PDDocument.load(filename);

// look at all the document information
PDDocumentInformation info = doc.getDocumentInformation();
COSDictionary dict = info.getDictionary();
List l = dict.keyList();
for (Object o : l) {
    //System.out.println(o.toString() + " " + dict.getString(o));
    System.out.println(o.toString());
}

// look at the document catalog
PDDocumentCatalog cat = doc.getDocumentCatalog();
System.out.println("Catalog:" + cat);

List<PDPage> lp = cat.getAllPages();
System.out.println("# Pages: " + lp.size());
PDPage page = lp.get(4);
System.out.println("Page: " + page);
System.out.println("\tCropBox: " + page.getCropBox());
System.out.println("\tMediaBox: " + page.getMediaBox());
System.out.println("\tResources: " + page.getResources());
System.out.println("\tRotation: " + page.getRotation());
System.out.println("\tArtBox: " + page.getArtBox());
System.out.println("\tBleedBox: " + page.getBleedBox());
System.out.println("\tContents: " + page.getContents());
System.out.println("\tTrimBox: " + page.getTrimBox());
List<PDAnnotation> la = page.getAnnotations();
System.out.println("\t# Annotations: " + la.size());

Allargando answered 23/8, 2010 at 16:53 Comment(0)

The object viewer in Acrobat is good but Windjack Solution has a plugin for Acrobat called PDF Canopener that allows better inspection with an eyedropper for selecting objects on page. Also permits modifications to be made to PDF.

https://www.windjack.com/product/pdfcanopener/

Chevalier answered 24/8, 2010 at 19:11 Comment(0)

If you want to work programmatically from within Python, pdfminer is a good option. It allows you to work with PDF structure in memory as an object hierarchy or serialize it as XML.

Greenock answered 28/10, 2018 at 16:29 Comment(1)

This was an excellent recommendation, thanks! (pdfminer is now known as pdfminer.six. It worked like a charm for me. All I wanted to do was dump the structure of the table of contents, and that was actually one of their examples in the documentation.) – Faden 10/2 at 15:54

-8

My sugession is Foxit PDF Reader which is very helpful to do important text editing work on pdf file.

Moulden answered 11/3, 2016 at 0:5 Comment(1)

I couldn't find any way in Foxit Reader to view the internal structure of a PDF similar to PDF Inspector (referenced in the question) – Gawky 12/2, 2017 at 20:48

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

qpdf (my favorite)

mutool

podofouncompress

peepdf.py

pdfid.py and pdf-parser.py