R

7

21

I'm writing a CAD application that outputs PDF files using the Cairo graphics library. A lot of the unit testing does not require actually generating the PDF files, such as computing the expected bounding boxes of the objects. However, I want to make sure that the generated PDF files "look" correct after I change the code. Is there an automated way to do this? How can I automate as much as possible? Do I need to visually inspect each generated PDF? How can I solve this problem without pulling my hair out?

Roxy answered 12/1, 2011 at 19:11 Comment(3)

You could look at how matplotlib or sage plotting capabilities are tested. – Unknow 12/1, 2011 at 21:33

similar question, I posted here: #1311336 – Elo 11/2, 2014 at 3:19

What does this have to do with Imagemagick? Why did you tag that? – Michamichael 26/3, 2024 at 16:25

E

8

You could capture the PDF as a bitmap (or at least a losslessly-compressed) image, and then compare the image generated by each test with a reference image of what it's supposed to look like. Any differences would be flagged as an error for the test.

Ephesian answered 12/1, 2011 at 19:16 Comment(2)

In my case, the PDF graphics are stored as vectors so the file size is very small. I did a test with Cairo, and it turns out that the PDF generation is deterministic, meaning that a simple diff is enough to flag an error, like you suggest. – Roxy 13/1, 2011 at 0:47

Please note that this depends on the cairo versions. Different versions of the cairo library will likely generate slightly different PDF output. – Littles 13/8, 2012 at 7:22

B

24

(See also update below!)

I'm doing the same thing using a shell script on Linux that wraps

ImageMagick's compare command
the pdftk utility
Ghostscript (optionally)

(It would be rather easy to port this to a .bat Batch file for DOS/Windows.)

I have a few reference PDFs created by my application which are "known good". Newly generated PDFs after code changes are compared to these reference PDFs. The comparison is done pixel by pixel and is saved as a new PDF. In this PDF, all unchanged pixels are painted in white, while all differing pixels are painted in red.

Here are the building blocks:

pdftk

Use this command to split multipage PDF files into multiple singlepage PDFs:

pdftk  reference.pdf  burst  output  somewhere/reference_page_%03d.pdf
pdftk  comparison.pdf burst  output  somewhere/comparison_page_%03d.pdf

compare

Use this command to create a "diff" PDF page for each of the pages:

compare \
       -verbose \
       -debug coder -log "%u %m:%l %e" \
        somewhere/reference_page_001.pdf \
        somewhere/comparison_page_001.pdf \
       -compose src \
        somewhereelse/reference_diff_page_001.pdf

Ghostscript

Because of automatically inserted meta data (such as the current date+time), PDF output is not working well for MD5hash-based file comparisons.

If you want to automatically discover all cases which consist of purely white pages, you could also convert to a meta-data free bitmap format using the bmp256 output device. You can do that for the original PDFs (reference and comparison), or for the diff-PDF pages:

 gs \
   -o reference_diff_page_001.bmp \
   -r72 \
   -g595x842 \
   -sDEVICE=bmp256 \
    reference_diff_page_001.pdf

 md5sum reference_diff_page_001.bmp

If the MD5sum is what you expect for an all-white page of 595x842 PostScript points, then your unit test passed.

Update:

I don't know why I didn't previously think of generating a histogram output from the ImageMagick compare...

The following is a command pipeline chaining 2 different commands:

the first one is the same as the above compare which generates the 'white pixels are equal, red pixels are differences'-format, only it outputs the ImageMagick internal miff format. It doesn't write to a file, but to stdout.
the second one uses convert to read stdin, generate a histogram and output the result in text form. There will be two lines:
- one indicating the number of white pixels
- the other one indicating the number of red pixels.

Here it goes:

compare \
   reference.pdf \
   current.pdf \
  -compose src \
   miff:- \
| \
convert \
   - \
  -define histogram:unique-colors=true \
  -format %c \
   histogram:info:-

Sample output:

 56934: (61937,    0, 7710,52428) #F1F100001E1ECCCC srgba(241,0,30,0.8)
444056: (65535,65535,65535,52428) #FFFFFFFFFFFFCCCC srgba(255,255,255,0.8)

(Sample output was generated by using these reference.pdf and current.pdf files.)

I think this type of output is really well suited for automatic unit testing. If you evaluate the two numbers, you can easily compute the "red pixel" percentage and you could even decide to return PASSED or FAILED based on a certain threshold (if you don't necessarily need "zero red" for some reason).

Bigg answered 12/1, 2011 at 23:5 Comment(1)

+1 for your very detailed walkthrough! In my case, I am using vector graphics and I just wanted to know if anything changed, so I ended up using binary diff. I created a known_good subdirectory with the "human verified" PDF files, and this code does the actual comparison: def different(a,b): return subprocess.call(['diff', a, b, '--brief']) != 0. (This uses the Mac OS X diff command; there is a portable one in Python.) This approach would not work if I had too many PDF files to check or if the generator was nondeterministic, but so far it looks like my problem is solved. – Roxy 13/1, 2011 at 0:55

E

8

You could capture the PDF as a bitmap (or at least a losslessly-compressed) image, and then compare the image generated by each test with a reference image of what it's supposed to look like. Any differences would be flagged as an error for the test.

Ephesian answered 12/1, 2011 at 19:16 Comment(2)

In my case, the PDF graphics are stored as vectors so the file size is very small. I did a test with Cairo, and it turns out that the PDF generation is deterministic, meaning that a simple diff is enough to flag an error, like you suggest. – Roxy 13/1, 2011 at 0:47

Please note that this depends on the cairo versions. Different versions of the cairo library will likely generate slightly different PDF output. – Littles 13/8, 2012 at 7:22

D

0

The first idea that pops in my head is to use a diff utility. These are generally used to compare texts of documents but they might also compare the layout of the PDF. Using it, you can compare the expected output with the output supplied.

The first result google gives me is this. Altough it is commercial, there might be other free/open source alternatives.

Dactylo answered 12/1, 2011 at 22:4 Comment(0)

R

0

I would try this using xpresser - (https://wiki.ubuntu.com/Xpresser ) You can try to match images to similar images not exact copies - which is the problem in these cases.

I don't know if xpresser is being ctively developed, or if it can be used with stand alone image files (I think so) -- anyway it takes its ideas from teh Sikuli project (which is Java with a Jython front end, while xpresser is Python).

Revalue answered 13/1, 2011 at 2:59 Comment(0)

H

0

I wrote a tool in Python to validate PDFs for my employer's documentation. It has the capability to compare individual pages to master images. I used a library I found called swftools to export the page to PNG, then used the Python Imaging Library to compare it with the master.

The relevant code looks something like this (this won't run as there are some dependencies on other parts of the script, but you should get the idea):

# exporting

gfxpdf = gfx.open("pdf", self.pdfpath)
if os.path.isfile(pngPath):
    os.remove(pngPath)
page = gfxpdf.getPage(pagenum)
img = gfx.ImageList()
img.startpage(page.width, page.height)
page.render(img)
img.endpage()
img.save(pngPath)
return os.path.isfile(pngPath)

# comparing

outPng = os.path.join(outpath, pngname)
masterPng = os.path.join(outpath, "_master", pngname)
if os.path.isfile(masterPng):
    output = Image.open(outPng).convert("RGB") # discard alpha channel, if any
    master = Image.open(masterPng).convert("RGB")
    mismatch = any(x[1] for x in ImageChops.difference(output, master).getextrema())

Hokusai answered 18/7, 2011 at 18:34 Comment(0)

R

0

"cmppdf" compares either the visual appearance or text content of PDFs.

It is a bash script, downloadable from https://abhweb.org/jima/cmppdf?v

It uses pdftk and compare to graphically compare PDFs, similar to what others have described in other answers. Meta data (anything which does not change the actual appearance) is not compared.

The text-comparison option uses pdftotxt and diff.

Roentgenograph answered 12/3, 2023 at 0:52 Comment(0)

K

0

I often use Python and Reportlab to generate PDFs, so I test them in a couple of ways:

Test the individual components like text, Matplotlib plots, or SVG drawings before they get added to the Reportlab doc.
Test the completed PDF doc by converting it to a PNG image with PyMuPDF / fitz. Here's an example that converts the first page of a PDF.

For either type of image comparison, I built an image differ class that takes in two images and highlights the differences. You can set it up as a Pytest fixture, as described in the class docs. If you use my live coding plugin, it will update the display as you edit your code.

I've generally tried to avoid comparing the test results to static images, because fonts change, and it's annoying to keep the static images up to date. Instead, I write the unit test to generate an expected image, then call the system under test to generate an image and compare the two. This works best when I write building blocks of code, tested at each level. Otherwise, the unit tests get more complicated than the real code.

Kazim answered 26/3, 2024 at 16:5 Comment(0)

(See also update below!)

pdftk

compare

Ghostscript

Update:

Recommended topics

Hot tags