How can I visually inspect the structure of a PDF to reverse engineer it? [closed]
Asked Answered
N

10

147

How can I inspect the structure of PDF files?

Use case: I'm trying to programmatically generate PDF files (using iText). I'm having trouble achieving certain layouts, but I have PDF files with text laid out the way I want (generated from Word). I would like to reverse engineer how they do it.

PDF Inspector seems to be good, but I'm looking for something for Windows.

Nador answered 23/8, 2010 at 16:22 Comment(5)
PDF Inspector is Java based, so multiplatform.Trimorphism
Doesn't seem to run on Windows though. The jar doesn't do anything when clicked on. When called at the command line I get no main manifest attribute, in PDF Document Inspector.jarMenell
@Trimorphism it's java based but apple wrapped so it's kinda apple only distribution. There is "PDF Document Inspector.app/Contents/Resources/Java/PDF Document Inspector.jar" jar but it's not startable as java -jar "PDF Document Inspector.jar" Also there is lot of com.apple.cocoa.* includes that are platform specific. :(Torin
I'm using now successfully iText Rups, multiplatform and Java based.Trimorphism
Unfortunately I can't add an answer since the question is closed, but after much searching I finally found this tool: brendandahl.github.io/pdf.js.utils/browser (using pdf.js under the hood to inspect the structure of your pdf). I've had a lot of success reverse engineering pdfs with this page.Yingyingkow
C
24

Adobe Acrobat has a very cool but rather well hidden mode allowing you to inspect PDF files. I wrote a blog article explaining it at https://blog.idrsolutions.com/2009/04/viewing-pdf-objects/

Cistern answered 24/8, 2010 at 6:41 Comment(7)
This seems to require a plugin; at least it's not available in Acrobat Reader 9.5.5 on Linux.Schleswig
@AdamSpiers, preflight dialog box is a feature of Adobe Acrobat, not Adobe ReaderTripetalous
... and Acrobat (formerly Acrobat Exchange) is not available for Linux :-/Schleswig
Preflight dialog box actually requires Adobe Acrobat Pro. It is not available in Adobe Acrobat Standard.Hersch
And it is a UI nightmare to actually use.Miscreant
Well we do not use Adobe Acrobat - so how to inspect the PDF without it?Fechner
I know this is a very old thread, but I found an online PDF inspector, which allows you to browse the PDF structure in a way very similar to how Adobe does it. It is slightly less powerful than Adobe, but it's free and online, so might still be useful for somebody…Consumption
T
140

Besides the GUI-based tools mentioned in the other answers, there are a few command line tools which can transform the original PDF source code into a different representation which lets you inspect the (now modified file) with a text editor. All of the tools below work on Linux, Mac OS X, other Unix systems or Windows.

qpdf (my favorite)

Use qpdf to uncompress (most) object's streams and also dissect ObjStm objects into individual indirect objects:

qpdf --qdf --object-streams=disable orig.pdf uncompressed-qpdf.pdf

qpdf describes itself as a tool that does "structural, content-preserving transformations on PDF files".

Then just open + inspect the uncompressed-qpdf.pdf file in your favorite text editor. Most of the previously compressed (and hence, binary) bytes will now be plain text.

mutool

There is also the mutool command line tool which comes bundled with the MuPDF PDF viewer (which is a sister product to Ghostscript, made by the same company, Artifex). The following command does also uncompress streams and makes them more easy to inspect through a text editor:

mutool clean -d orig.pdf uncompressed-mutool.pdf

podofouncompress

PoDoFo is an FreeSoftware/OpenSource library to work with the PDF format and it includes a few command line tools, including podofouncompress. Use it like this to uncompress PDF streams:

podofouncompress orig.pdf uncompressed-podofo.pdf

peepdf.py

PeePDF is a Python-based tool which helps you to explore PDF files. Its original purpose was for research and dissection of PDF-based malware, but I find it useful also to investigate the structure of completely benign PDF files.

It can be used interactively to "browse" the objects and streams contained in a PDF.

I'll not give a usage example here, but only a link to its documentation:

pdfid.py and pdf-parser.py

pdfid.py and pdf-parser.py are two PDF tools by Didier Stevens written in Python.

Their background is also to help explore malicious PDFs -- but I also find it useful to analyze the structure and contents of benign PDF files.

Here is an example how I would extract the uncompressed stream of PDF object no. 5 into a *.dump file:

pdf-parser.py -o 5 -f -d obj5.dump my.pdf

Final notes

  1. Please note that some binary parts inside a PDF are not necessarily uncompressible (or decode-able into human readable ASCII code), because they are embedded and used in their native format inside PDFs. Such PDF parts are JPEG images, fonts or ICC color profiles.

  2. If you compare above tools and the command line examples given, you will discover that they do NOT all produce identical outputs. The effort of comparing them for their differences in itself can help you to better understand the nature of the PDF syntax and file format.

Triturate answered 6/4, 2015 at 15:37 Comment(2)
Any idea how I can inspect a JBIG2 Stream? E.g. a Stream that uses Filter "/Jbig2decode"? They are sadly still unreadable using these methodsKenyettakenyon
For mutool I recommend adding -c, so mutool clean -c -d orig.pdf uncompressed-mutool.pdf, so that each instruction in the content stream will be on a separate line so it's easier to read.Raby
S
71

I use iText RUPS(Reading and Updating PDF Syntax) in Linux. Since it's written in Java, it works on Windows, too. You can browse all the objects in PDF file in a tree structure. It can also decode Flate encoded streams on-the-fly to make inspecting easier.

Here is a screenshot:

iText RUPS screenshot

Shrewd answered 3/6, 2012 at 10:1 Comment(17)
java -jar itext-rups-5.5.6.jar -> Exception in thread "AWT-EventQueue-0" java.lang.NoClassDefFoundError: com/itextpdf/text/Version - How are you supposed to run this thing? Edit: Figured it out. You should not download the default file offered by SourceForge, you need to download the .jar which includes dependencies.Vying
@Vying just came across the same thing. Thanks for your comment.Dzungaria
@Zero3: You should no longer download from SF at all...Triturate
@KurtPfeifle I completely agree. Unfortunately, a lot of software (like this!) is only available through SourceForge because the maintainer did not move the project elsewhere yet, and might never do so. You should indeed be very careful when downloading anything from SourceForge these days...Vying
@Vying at the time you wrote that comment, all iText related software, including RUPS, was already on GitHub for more than 6 months. There is also the official iText website, itextpdf.comMythology
@Vying the release of iText 5.5.9 is scheduled for next week and might not be offered on Sourceforge. I will put up a notice to tell people where we have moved. Unfortunately that will make some other people unhappy, but you cannot please all of the people all of the time.Mythology
@AmedeeVanGasse Great! I was not aware, as I just followed the link by gkcn. I'm not sure what you mean with making other people unhappy. Abandoning SourceForge seems like the only sensible thing to do.Vying
There are tons of ancient links all over the web, also on StackOverflow, that point to Sourceforge. If they point to the main project page, then it's okay and they will see the notice that I will put up. But if it is a deep link to a specific file on a specific commit, and I remove that, then people will get a 404.Mythology
@AmedeeVanGasse is iText RUPS available as a compiled jar ready to use by non-developers?Charlinecharlock
Yes - as a compiled jar and even as an exe, for Windows users. See github.com/itext/rups/releases/latestMythology
@AmedeeVanGasse the screenshot in this answer shows a view of the page (between the document tree and xref tab). How can I display that view in v5.5.9 on Windows?Charlinecharlock
Please start a new question.Mythology
AGPL version has no built-in renderer...Catarinacatarrh
for all experiencing Exception in thread "AWT-EventQueue-0" issue try running other jar from zipfile: java -jar itext-rups-5.5.9-jar-with-dependencies.jarArgentine
I found PikePDF to be an excellent way to get at QPDF’s functionality from Python.Disentomb
If you get java.lang.UnsatisfiedLinkError: Can't load library: /usr/lib/jvm/java-11-openjdk-amd64/lib/libawt_xawt.so, try sudo apt-get install openjdk-11-jreTelson
Current RUPS version does even allow for editing the PDF structure right from the GUI.Wellpreserved
C
24

Adobe Acrobat has a very cool but rather well hidden mode allowing you to inspect PDF files. I wrote a blog article explaining it at https://blog.idrsolutions.com/2009/04/viewing-pdf-objects/

Cistern answered 24/8, 2010 at 6:41 Comment(7)
This seems to require a plugin; at least it's not available in Acrobat Reader 9.5.5 on Linux.Schleswig
@AdamSpiers, preflight dialog box is a feature of Adobe Acrobat, not Adobe ReaderTripetalous
... and Acrobat (formerly Acrobat Exchange) is not available for Linux :-/Schleswig
Preflight dialog box actually requires Adobe Acrobat Pro. It is not available in Adobe Acrobat Standard.Hersch
And it is a UI nightmare to actually use.Miscreant
Well we do not use Adobe Acrobat - so how to inspect the PDF without it?Fechner
I know this is a very old thread, but I found an online PDF inspector, which allows you to browse the PDF structure in a way very similar to how Adobe does it. It is slightly less powerful than Adobe, but it's free and online, so might still be useful for somebody…Consumption
E
12

PDFXplorer from O2 Solutions does an outstanding job of displaying the internals if you're on a Windows machine.

http://www.o2sol.com/pdfxplorer/overview.htm

(Free, distracting banner at the bottom).

Ennoble answered 17/12, 2017 at 13:33 Comment(0)
R
11

If you're on Windows, PDF Analyzer is similar to PDFXplorer, but it has more options. It is also free after a single registration.

enter image description here

Rozella answered 17/12, 2018 at 13:16 Comment(4)
For me PDFXplorer works much better, because it goes deeper into the contents.Sinful
@Sinful how do you mean, in the tree? I like the fact that PDFAnalyzer can show text and can dump images.Rozella
I compared PDFxplorer and PDF Analyzer and PDFXplorer lets me dig down a bit deeper into the internal structures of the streams than PDF Analyzer.Sinful
For people reading this that want to try PDF Analyzer, you don't need to register into their site just fill the names and emails with anything and click "Register my free copy" but make sure to block the application from accessing Internet through your firewall, or disable Internet while registering the application.Angellaangelle
F
9

There is also another option. Adobe Acrobat Pro is also able to display the internal tree structure of the PDF.

  1. Open Preflight
  2. Go to Options (right upper corner)
  3. Internal PDF Structure

On top Adobe Acrobat Pro can also display the internal structure of the Document Fonts in the PDF most of other "PDF tree structure viewer" don't have this otion

enter image description here

Flushing answered 23/9, 2015 at 9:15 Comment(2)
This is what @mark-stephens describes in the accepted answer.Roomer
@mark-stephens' answer just links to a blog post that might disappear in the future (and is discouraged on SO). vadimo's actually provides the answer.Battles
A
5

I've used PDFBox with good success. Here's a sample of what the code looks like (back from version 0.7.2), that likely came from one of the provided examples:

// load the document
System.out.println("Reading document: " + filename);
PDDocument doc = null;                                                                                                                                                                                                          
doc = PDDocument.load(filename);

// look at all the document information
PDDocumentInformation info = doc.getDocumentInformation();
COSDictionary dict = info.getDictionary();
List l = dict.keyList();
for (Object o : l) {
    //System.out.println(o.toString() + " " + dict.getString(o));
    System.out.println(o.toString());
}

// look at the document catalog
PDDocumentCatalog cat = doc.getDocumentCatalog();
System.out.println("Catalog:" + cat);

List<PDPage> lp = cat.getAllPages();
System.out.println("# Pages: " + lp.size());
PDPage page = lp.get(4);
System.out.println("Page: " + page);
System.out.println("\tCropBox: " + page.getCropBox());
System.out.println("\tMediaBox: " + page.getMediaBox());
System.out.println("\tResources: " + page.getResources());
System.out.println("\tRotation: " + page.getRotation());
System.out.println("\tArtBox: " + page.getArtBox());
System.out.println("\tBleedBox: " + page.getBleedBox());
System.out.println("\tContents: " + page.getContents());
System.out.println("\tTrimBox: " + page.getTrimBox());
List<PDAnnotation> la = page.getAnnotations();
System.out.println("\t# Annotations: " + la.size());
Allargando answered 23/8, 2010 at 16:53 Comment(0)
C
4

The object viewer in Acrobat is good but Windjack Solution has a plugin for Acrobat called PDF Canopener that allows better inspection with an eyedropper for selecting objects on page. Also permits modifications to be made to PDF.

https://www.windjack.com/product/pdfcanopener/

Chevalier answered 24/8, 2010 at 19:11 Comment(0)
G
2

If you want to work programmatically from within Python, pdfminer is a good option. It allows you to work with PDF structure in memory as an object hierarchy or serialize it as XML.

Greenock answered 28/10, 2018 at 16:29 Comment(1)
This was an excellent recommendation, thanks! (pdfminer is now known as pdfminer.six. It worked like a charm for me. All I wanted to do was dump the structure of the table of contents, and that was actually one of their examples in the documentation.)Faden
M
-8

My sugession is Foxit PDF Reader which is very helpful to do important text editing work on pdf file.

Moulden answered 11/3, 2016 at 0:5 Comment(1)
I couldn't find any way in Foxit Reader to view the internal structure of a PDF similar to PDF Inspector (referenced in the question)Gawky

© 2022 - 2024 — McMap. All rights reserved.