Tool to compare large numbers of PDF files? [closed]
Asked Answered
G

10

84

I need to compare large count of PDF files for it optical content. Because the PDF files was created on different platforms and with different versions of the software there are structural differences. For example:

  • the chunking of text can be different
  • the write order can be different
  • the position can be differ some pixels

It should compare the content like a human people and not the internal structure. I want test for regressions between different versions of the PDF generator that we used.

Gerfen answered 28/9, 2008 at 11:2 Comment(12)
A partial answer would be to use pdftotext and compare the text contained.Bramble
But this will ignore all non text informations like lines, boxes, pictures, charts, etc. I think also that it not show the optical positions of text else the structural position.Gerfen
I agree, it is not a sufficient criteria. On the other hand it is a necessary criteria, therefore it is adequate as a unit test.Bramble
Never actually been in your situation before, but I've tried ExamDiff Pro to compare PDFs and it worked for me.Fendig
You can always add a better unit test later!Bramble
If there are images on pages, and you want a human-like evaluation for those, there's not much you can do but have a human compare those pages, unless you want to work on a whole new project, just as big as your current one, to try it out.Vivanvivarium
I think Bitmap check should work in your case. I use a automation tool to compare 2 images using bitmap check pointRiki
What an intelligent, \\*#?`%& decision to close this question as 'not constructive'! (Gotta luv it when question-closing-moderators destroy community content which carries tags where these same mods don't have any personal reputation in!)Abject
Another case of useless closing a question concerning a highly relevant realworld use-case. I wish I knew how to propose a sound reasoning on Meta so this will stop eventually. It just feels so wrong every time it happens.Fenderson
related: superuser.com/q/46123/35237Sofar
There is a FREE library to compare pdf pixel by pixel. Check this blog. testautomationguru.com/…Deception
You can user Copyleaks Compare Two PDF free tool. You can upload up to 12 files for comparison. Additional, the comparison is textual not semantics (GIT style).Totipalmate
G
41

Because there is no such tool available that we have written one. You can download the i-net PDF content comparer and use it. I hope that help other with the same problem. If you have problems with it or you have feedback for us then you can contact our support.

enter image description here

Gerfen answered 16/2, 2010 at 8:34 Comment(6)
The advantage of this tool is, that it's neither a pure text comparer nor an image comparer. It compares by structure, checks if the containing elements are "the same" - so your compared PDFs do not have to match 100% but be within a definable similarity. And it's for free.Daigle
I'd recommend this too! It crashed on a document so I sent it to them. They fixed it! :D I feel great. It can generate images with differences or it can give you a textual report in the console.Yesterday
@Daigle Where is that application free? It costs at least 200 USD per year (!). It's only free once for 30 days. That's way too expensive for what I'd do with it.Bobble
@LonelyPixel Yep, you're right. Version 1.0 was for free (as of 2010-10-14). We've changed quite a bit on it and it's now a paid tool (2012-10). You can however try it for 30 days without any limitations. It has really gained a lot of new features, stability and reliability. I hope you still have a look at it ;)Daigle
I too need to compare pdf files - I have come up with a jar using apache pdfbox. Check this testautomationguru.com/… for example & download.Deception
This is a great tool. Unfortunately, it gets severely distracted by line numbers (I am comparing my author-generated pdf to publisher page proofs that do have line numbers). Could the tool be made to ignore (line) numbers?Kantian
H
20

There is actually a diffpdf tool.

http://www.qtrac.eu/diffpdf.html

Its weakness is that it doesn't react well when additions make new text shift partially to a new page. For instance, if old page 4 should be compared to the end of page 5 and the beginning of page 6, you'll need to shift parameters to compare the two slices separately.

Humphrey answered 3/5, 2011 at 11:49 Comment(1)
The original open source version is still available at qtrac.eu/diffpdf-foss.htmlSofar
G
14

I've used a home-baked script which

  • converts all pages on two PDFs to bitmaps
  • colors pages of PDF 1 to red-on-white
  • changes white to transparent on pages of PDF 2
  • overlays each page from PDF 2 on top of the corresponding page from PDF 1
  • runs conversion/coloring and overlaying in parallel on multiple cores

Software used:

  • GhostScript for PDF-to-bitmap conversion
  • ImageMagick for coloring, transparency and overlay
  • inotify for synchronizing parallel processes
  • any PNG-capable image viewer for reviewing the result

Pros:

  • simple implementation
  • all tools used are open source
  • great for finding small differences in layout

Cons:

  • the conversion is slow
  • major differences between PDFs (e.g. pagination) result in a mess
  • bitmaps are not zoomable
  • only works well for black-and-white text and diagrams
  • no easy-to-use GUI

I've been looking for a tool which would do the same on PDF/PostScript level.

Here's how our script invokes the utilities (note that ImageMagick uses GhostScript behind the scenes to do the PDF->PNG conversion):

$ convert -density 150x150 -fill red -opaque black +antialias 1.pdf back%02d.png
$ convert -density 150x150 -transparent white +antialias 2.pdf front%02d.png
$ composite front01.png back01.png result01.png # do this for all pairs of images
Galactopoietic answered 10/2, 2010 at 8:59 Comment(3)
Why not share the full script?Yesterday
This is what I used for compositing: for i in $(seq -w 0 05); do /cygdrive/c/Progra~1/ImageMagick-6.6.9-Q8/composite.exe 1-$i.png 2-$i.png result-$i.png; doneYesterday
Here's a script that doesn't write temporary files to disk and uses Poppler's pdftoppm, which is faster than Ghostscript: gist.github.com/brechtm/891de9f72516c1b2cbc1. It outputs one JPG for each page of the PDFs in a pdfdiff directory and additionally prints the numbers of the pages which differ between the two PDFs.Alack
L
14

I don't seem to be able to see this here, so here it is: via superuser: How to compare the differences between two PDF files? (answer #229891, by @slestak), there is

https://github.com/vslavik/diff-pdf

(build steps for Ubuntu Natty can be found in get-diff-pdf.sh)

As far as I can see, it basically overlays the text/graphics of each page in the pdf(s), allowing you to easily see if there were any changes...

Cheers!

Lecce answered 8/5, 2011 at 6:36 Comment(0)
G
10

We've also used pdftotext (see Sklivvz's answer) to generate ASCII versions of PDFs and wdiff to compare them.

Use pdftotext's -layout switch to enhance readability and get some idea of changes in the layout.

To get nice colored output from wdiff, use this wrapper script:

#!/bin/sh
RED=$'\e'"[1;31m"
GREEN=$'\e'"[1;32m"
RESET=$'\e'"[0m"
wdiff -w$RED -x$RESET -y$GREEN -z$RESET -n $1 $2
Galactopoietic answered 10/2, 2010 at 9:8 Comment(0)
T
4

I think your best approach would be to convert the PDF to images at a decent resolution and than do an image compare.

To generate images from PDF you can use Adobe PDF Library or the solution suggested at Best way to convert pdf files to tiff files.

To compare the generated TIFF files I found GNU tiffcmp (for windows part of GnuWin32 tiff) and tiffinfo did a good job. Use tiffcmp -l and count the number of lines of output to find any differences. If you are happy to have a small amount of content change (e.g. anti-aliasing differences) then use tiffinfo to count the total number of pixels and you can then generate a percentage difference value.

By the way for anyone doing simple PDF comparison where the structure hasn't changed it is possible to use command line diff and ignore certain patterns, e.g. with GNU diff 2.7:

diff --brief -I xap: -I xapMM: -I /CreationDate -I /BaseFont -I /ID --binary --text

This still has the problem that it doesn't always catch changes in generated font names.

Tejada answered 29/9, 2008 at 15:4 Comment(2)
I think the comparing of 2 images is more complex then comparing the PDF files self.Gerfen
Comparing images can be done with GnuWin32 tiffcmp. I will update my answer to elaborate on this.Tejada
L
1

Our product, PDF Comparator - http://www.premediasystems.com/pdfc.html" - will do this quite elegantly and efficiently. It's also not free, and is a Mac OS X only application.

Liederkranz answered 3/8, 2010 at 0:9 Comment(11)
This tool compare pixel by pixel. This is very simple. The question was a compare like a human people do it.Gerfen
@Horcrux7: But how else than comparing 'pixel by pixel' do human eyes compare different pages that are similar looking?!?Abject
@KurtPfeifle - I realize this is an old comment...but human beings do not compare images on a pixel to pixel basis; the way human beings compare differences in images is pretty complex, but relies heavily on pattern recognition and heuristics.Virtuosic
@CBRF23: True, and I'm aware of that -- but all this heuristics in the end still roots in "pixel-by-pixel" comparisons. For some other, higher level heuristics, performed with ImageMagick, see some of my other answers: one -- two -- three.Abject
@CBRF23: ...and the original poster, (at)Hocrux7 even mentioned "pixels" in his question, and explicitely didn't want "internal structure" of the files compared (even though his comment here again contradicts it).Abject
@KurtPfeifle - nice examples how to use ImageMagik - but I would not compare that to human perception, humans just aren't built for pixel by pixel comparisons. I prove my point: using your wizard example with the four images, pick any two of them and try to identify all the different pixels without using any tools - just your eyes. I guarantee you cannot do it. You may spot some clusters of pixels that are different, but without using tools (e.g. software, or writing utensils) you will not be able to do this. You cannot identify how many pixels there are, let alone all that are different.Virtuosic
@KurtPfeifle - I'm not arguing this answer is useful - just refuting your assertion that a pixel by pixel comparison is analogous to how human beings perceive differences in images ;)Virtuosic
@CBRF23: You're missing the point. The OP (from 2008!) asked for a tool to compare a "large number of PDF files" -- just because he didn't want to have it done it by humans themselves. The (good and bad) answers here reflect what people at the time suggested. (I myself came across this thread only in 2012!). --- Of course I cannot identify, without tools, all pixels that are different! What makes you think I said so? -- If you ask for a tool, you have to base it on pixel-by-pixel comparisons. And even human perception, in the end, is rooted in "pixel-by-pixel" viewing...Abject
Let us continue this discussion in chat.Virtuosic
@CBRF23: Sorry, I'm just on my way off + offline....Abject
@KurtPfeifle - no worries, it's off-topic discussion on a seven year old post - we both have better things to do ;)Virtuosic
P
1

Based on your needs, a convert to text solution would be the easiest and most direct. I did think the bitmap idea was pretty cool.

Pahang answered 4/2, 2011 at 0:52 Comment(0)
M
0

blubeam pdf software will do this for you

Monazite answered 23/3, 2010 at 13:55 Comment(0)
M
0

You can batch compare pdf files with Tarkware Pdf Comparer. But it's not free and requires Adobe Acrobat.

Mourner answered 28/3, 2010 at 21:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.