Reliable way to (programmatically) compare PDFs? [duplicate]
Asked Answered
C

4

9

Possible Duplicate:
Tool to compare large numbers of PDF files?

I am in the classic scenario where the business gives you a bunch of new pdf forms for the new year with no revision notes whatsoever and you are supposed to figure out what's different from the previous year ones.

I am talking loads of forms here, so I am trying to find a way to compare PDFs to outline differences without having people to manually go through each and every one of them.

My idea was to extract all the text from the PDFs and dump it into a .txt then run differences on text files, but it sounds horrible.

My question says programmatically, but I'd be happy with any reliable tools for comparing PDFs, and mainly looking to get an idea from people experiences. Also willing to entertain any programmatic solutions (preferably in C# but pls shoot out any ideas).

Coir answered 30/9, 2010 at 21:18 Comment(1)
Why is this duplicate? clearly the question asked is about how to do it programmatically. Any number of installable tools are not the answer to this question.Symposium
P
8

There is quite a few software products that claim to diff pdfs. I've never had need to use one but if this is going to be a recurring process I think it'd be wise for your company to invest in one of them. Just Google "pdf diff" for a bunch of potential applications.

Additionally, your situation is very similar to this question: Tool to compare large numbers of PDF files? I think its discussion may help.

Partly answered 30/9, 2010 at 21:43 Comment(1)
thanks for that - that question is indeed very similar (for some reason didn't pop up when I composed mine).Coir
B
7

I am a developer of Docotic.Pdf Library. We use PDF comparison in unit tests for checking that test produces PDF as expected. PDF is a collection of special objects and we compare all PDF objects ignoring some properties like trailer IDs and creator info. This implementation works fine.

You can try the method PdfDocument.DocumentsAreEqual. This method just tell you are documents equal, without specific differences. You may contact us if you need more functionality.

Boigie answered 2/10, 2010 at 3:47 Comment(0)
T
4

I went the approach to getting the raw data out of the PDF, then making use of Word or TortiseSVN, or WinMerge, etc...to take care of the comparison piece. In my instance I did the comparison in a RichTextBox in C#...coloring the differences, etc...since we wanted it all within our app.

Here is what I did... PDF comparison as I was trying to compare mixed documents, Word and PDF.

However I would recommend PDFBox for the parsing, a bit more elegant...although iTextSharp worked out ok...

Turning answered 30/9, 2010 at 21:50 Comment(0)
F
2

I wrote a blog suggesting some approaches to comparing PDF files at https://blog.idrsolutions.com/2010/09/comparing-2-pdf-files/

Friarbird answered 1/10, 2010 at 7:10 Comment(2)
convert pdf to image and then compare and still need human intervention ? How is this useful then ?Domingodominguez
The software can tell you if they have not changed so you know you have not broken anything. Only a human can evaluate any changes.Friarbird

© 2022 - 2024 — McMap. All rights reserved.