Tool to compare large numbers of PDF files? [closed]

Asked 28/9, 2008 at 11:2 Answered 8/5, 2011 at 6:36

I need to compare large count of PDF files for it optical content. Because the PDF files was created on different platforms and with different versions of the software there are structural differences. For example:

the chunking of text can be different
the write order can be different
the position can be differ some pixels

It should compare the content like a human people and not the internal structure. I want test for regressions between different versions of the PDF generator that we used.

Gerfen answered 28/9, 2008 at 11:2 Comment(12)

A partial answer would be to use pdftotext and compare the text contained. – Bramble 28/9, 2008 at 11:5

But this will ignore all non text informations like lines, boxes, pictures, charts, etc. I think also that it not show the optical positions of text else the structural position. – Gerfen 28/9, 2008 at 11:30

I agree, it is not a sufficient criteria. On the other hand it is a necessary criteria, therefore it is adequate as a unit test. – Bramble 28/9, 2008 at 11:35

Never actually been in your situation before, but I've tried ExamDiff Pro to compare PDFs and it worked for me. – Fendig 28/9, 2008 at 11:35

You can always add a better unit test later! – Bramble 28/9, 2008 at 11:36

If there are images on pages, and you want a human-like evaluation for those, there's not much you can do but have a human compare those pages, unless you want to work on a whole new project, just as big as your current one, to try it out. – Vivanvivarium 28/9, 2008 at 11:52

I think Bitmap check should work in your case. I use a automation tool to compare 2 images using bitmap check point – Riki 29/9, 2008 at 17:57

What an intelligent, \\*#?`%& decision to close this question as 'not constructive'! (Gotta luv it when question-closing-moderators destroy community content which carries tags where these same mods don't have any personal reputation in!) – Abject 18/9, 2012 at 21:22

Another case of useless closing a question concerning a highly relevant realworld use-case. I wish I knew how to propose a sound reasoning on Meta so this will stop eventually. It just feels so wrong every time it happens. – Fenderson 22/1, 2014 at 14:2

related: superuser.com/q/46123/35237 – Sofar 2/12, 2014 at 10:6

There is a FREE library to compare pdf pixel by pixel. Check this blog. testautomationguru.com/… – Deception 16/6, 2015 at 23:37

You can user Copyleaks Compare Two PDF free tool. You can upload up to 12 files for comparison. Additional, the comparison is textual not semantics (GIT style). – Totipalmate 26/7, 2020 at 4:57

Because there is no such tool available that we have written one. You can download the i-net PDF content comparer and use it. I hope that help other with the same problem. If you have problems with it or you have feedback for us then you can contact our support.

enter image description here

Gerfen answered 16/2, 2010 at 8:34 Comment(6)

The advantage of this tool is, that it's neither a pure text comparer nor an image comparer. It compares by structure, checks if the containing elements are "the same" - so your compared PDFs do not have to match 100% but be within a definable similarity. And it's for free. – Daigle 14/10, 2010 at 5:22

I'd recommend this too! It crashed on a document so I sent it to them. They fixed it! :D I feel great. It can generate images with differences or it can give you a textual report in the console. – Yesterday 10/6, 2011 at 21:9

@Daigle Where is that application free? It costs at least 200 USD per year (!). It's only free once for 30 days. That's way too expensive for what I'd do with it. – Bobble 11/10, 2012 at 8:10

@LonelyPixel Yep, you're right. Version 1.0 was for free (as of 2010-10-14). We've changed quite a bit on it and it's now a paid tool (2012-10). You can however try it for 30 days without any limitations. It has really gained a lot of new features, stability and reliability. I hope you still have a look at it ;) – Daigle 11/10, 2012 at 11:16

I too need to compare pdf files - I have come up with a jar using apache pdfbox. Check this testautomationguru.com/… for example & download. – Deception 14/6, 2015 at 0:11

This is a great tool. Unfortunately, it gets severely distracted by line numbers (I am comparing my author-generated pdf to publisher page proofs that do have line numbers). Could the tool be made to ignore (line) numbers? – Kantian 22/6, 2017 at 7:49

There is actually a diffpdf tool.

http://www.qtrac.eu/diffpdf.html

Its weakness is that it doesn't react well when additions make new text shift partially to a new page. For instance, if old page 4 should be compared to the end of page 5 and the beginning of page 6, you'll need to shift parameters to compare the two slices separately.

Humphrey answered 3/5, 2011 at 11:49 Comment(1)

The original open source version is still available at qtrac.eu/diffpdf-foss.html – Sofar 1/12, 2014 at 9:25

I've used a home-baked script which

converts all pages on two PDFs to bitmaps
colors pages of PDF 1 to red-on-white
changes white to transparent on pages of PDF 2
overlays each page from PDF 2 on top of the corresponding page from PDF 1
runs conversion/coloring and overlaying in parallel on multiple cores

Software used:

GhostScript for PDF-to-bitmap conversion
ImageMagick for coloring, transparency and overlay
inotify for synchronizing parallel processes
any PNG-capable image viewer for reviewing the result

Pros:

simple implementation
all tools used are open source
great for finding small differences in layout

Cons:

the conversion is slow
major differences between PDFs (e.g. pagination) result in a mess
bitmaps are not zoomable
only works well for black-and-white text and diagrams
no easy-to-use GUI

I've been looking for a tool which would do the same on PDF/PostScript level.

Here's how our script invokes the utilities (note that ImageMagick uses GhostScript behind the scenes to do the PDF->PNG conversion):

$ convert -density 150x150 -fill red -opaque black +antialias 1.pdf back%02d.png
$ convert -density 150x150 -transparent white +antialias 2.pdf front%02d.png
$ composite front01.png back01.png result01.png # do this for all pairs of images

Galactopoietic answered 10/2, 2010 at 8:59 Comment(3)

Why not share the full script? – Yesterday 19/5, 2011 at 20:25

This is what I used for compositing:

for i in $(seq -w 0 05); do /cygdrive/c/Progra~1/ImageMagick-6.6.9-Q8/composite.exe 1-$i.png 2-$i.png result-$i.png; done

– Yesterday 19/5, 2011 at 21:40

Here's a script that doesn't write temporary files to disk and uses Poppler's pdftoppm, which is faster than Ghostscript: gist.github.com/brechtm/891de9f72516c1b2cbc1. It outputs one JPG for each page of the PDFs in a pdfdiff directory and additionally prints the numbers of the pages which differ between the two PDFs. – Alack 31/3, 2016 at 13:47

I don't seem to be able to see this here, so here it is: via superuser: How to compare the differences between two PDF files? (answer #229891, by @slestak), there is

https://github.com/vslavik/diff-pdf

(build steps for Ubuntu Natty can be found in get-diff-pdf.sh)

As far as I can see, it basically overlays the text/graphics of each page in the pdf(s), allowing you to easily see if there were any changes...

Cheers!

Lecce answered 8/5, 2011 at 6:36 Comment(0)

We've also used pdftotext (see Sklivvz's answer) to generate ASCII versions of PDFs and wdiff to compare them.

Use pdftotext's -layout switch to enhance readability and get some idea of changes in the layout.

To get nice colored output from wdiff, use this wrapper script:

#!/bin/sh
RED=$'\e'"[1;31m"
GREEN=$'\e'"[1;32m"
RESET=$'\e'"[0m"
wdiff -w$RED -x$RESET -y$GREEN -z$RESET -n $1 $2

Galactopoietic answered 10/2, 2010 at 9:8 Comment(0)

I think your best approach would be to convert the PDF to images at a decent resolution and than do an image compare.

To generate images from PDF you can use Adobe PDF Library or the solution suggested at Best way to convert pdf files to tiff files.

To compare the generated TIFF files I found GNU tiffcmp (for windows part of GnuWin32 tiff) and tiffinfo did a good job. Use tiffcmp -l and count the number of lines of output to find any differences. If you are happy to have a small amount of content change (e.g. anti-aliasing differences) then use tiffinfo to count the total number of pixels and you can then generate a percentage difference value.

By the way for anyone doing simple PDF comparison where the structure hasn't changed it is possible to use command line diff and ignore certain patterns, e.g. with GNU diff 2.7:

diff --brief -I xap: -I xapMM: -I /CreationDate -I /BaseFont -I /ID --binary --text

This still has the problem that it doesn't always catch changes in generated font names.

Tejada answered 29/9, 2008 at 15:4 Comment(2)

I think the comparing of 2 images is more complex then comparing the PDF files self. – Gerfen 16/2, 2010 at 8:37

Comparing images can be done with GnuWin32 tiffcmp. I will update my answer to elaborate on this. – Tejada 16/2, 2010 at 9:7

Our product, PDF Comparator - http://www.premediasystems.com/pdfc.html" - will do this quite elegantly and efficiently. It's also not free, and is a Mac OS X only application.

Liederkranz answered 3/8, 2010 at 0:9 Comment(11)

This tool compare pixel by pixel. This is very simple. The question was a compare like a human people do it. – Gerfen 5/8, 2010 at 9:7

@Horcrux7: But how else than comparing 'pixel by pixel' do human eyes compare different pages that are similar looking?!? – Abject 18/9, 2012 at 21:49

@KurtPfeifle - I realize this is an old comment...but human beings do not compare images on a pixel to pixel basis; the way human beings compare differences in images is pretty complex, but relies heavily on pattern recognition and heuristics. – Virtuosic 18/8, 2015 at 17:28

@CBRF23: True, and I'm aware of that -- but all this heuristics in the end still roots in "pixel-by-pixel" comparisons. For some other, higher level heuristics, performed with ImageMagick, see some of my other answers: one -- two -- three. – Abject 18/8, 2015 at 17:38

@CBRF23: ...and the original poster, (at)Hocrux7 even mentioned "pixels" in his question, and explicitely didn't want "internal structure" of the files compared (even though his comment here again contradicts it). – Abject 18/8, 2015 at 17:41

@KurtPfeifle - nice examples how to use ImageMagik - but I would not compare that to human perception, humans just aren't built for pixel by pixel comparisons. I prove my point: using your wizard example with the four images, pick any two of them and try to identify all the different pixels without using any tools - just your eyes. I guarantee you cannot do it. You may spot some clusters of pixels that are different, but without using tools (e.g. software, or writing utensils) you will not be able to do this. You cannot identify how many pixels there are, let alone all that are different. – Virtuosic 18/8, 2015 at 17:46

@KurtPfeifle - I'm not arguing this answer is useful - just refuting your assertion that a pixel by pixel comparison is analogous to how human beings perceive differences in images ;) – Virtuosic 18/8, 2015 at 17:48

@CBRF23: You're missing the point. The OP (from 2008!) asked for a tool to compare a "large number of PDF files" -- just because he didn't want to have it done it by humans themselves. The (good and bad) answers here reflect what people at the time suggested. (I myself came across this thread only in 2012!). --- Of course I cannot identify, without tools, all pixels that are different! What makes you think I said so? -- If you ask for a tool, you have to base it on pixel-by-pixel comparisons. And even human perception, in the end, is rooted in "pixel-by-pixel" viewing... – Abject 18/8, 2015 at 17:53

Let us continue this discussion in chat. – Virtuosic 18/8, 2015 at 17:56

@CBRF23: Sorry, I'm just on my way off + offline.... – Abject 18/8, 2015 at 17:57

@KurtPfeifle - no worries, it's off-topic discussion on a seven year old post - we both have better things to do ;) – Virtuosic 18/8, 2015 at 18:3

Based on your needs, a convert to text solution would be the easiest and most direct. I did think the bitmap idea was pretty cool.

Pahang answered 4/2, 2011 at 0:52 Comment(0)

blubeam pdf software will do this for you

Monazite answered 23/3, 2010 at 13:55 Comment(0)

You can batch compare pdf files with Tarkware Pdf Comparer. But it's not free and requires Adobe Acrobat.

Mourner answered 28/3, 2010 at 21:13 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags