Remove multiple embedded font in pdf created with pdfTk
Asked Answered
B

2

9

Is there a way to remove fonts embedded multiple time from a pdf file?

This is my scenario:

1) a program generates several one-page pdf reports (querying a db, putting the info on an excel template and exporting the formatted information in pdf)

2) pdftk merges the single-page pdfs in one file.

Everything works fine, but the size of the resulting pdf is very large: in fact, I noticed that the fonts are embedded multiple times (as many time as the number of the page: all pages are generated starting from the same excel template, the fonts are embedded in the single pdf file and pdftk just glues the pdf). Is there a way to keet just one copy of each embedded font?

I tried to embed the fonts just in the first page while exporting from excel->pdf: the size of the file decreases dramatically, but it seems that the other pages can't access the embedded fonts.

Thanks, Alessandro

Bromeosin answered 16/5, 2012 at 21:28 Comment(4)
Can you provide 2-3 examples of your single-page PDFs? (Maybe using dummy data if the original data is too sensitive?)Hotchpot
Can you add the output of pdffonts input.pdf for a few of your input files, as well as pdffonts output.pdf for the file which pdftk created from the same set of inputs?Hotchpot
Sorry, I didn't see your comments here. I wrote below how to reproduce my problem with dummy word files. Is it possible to upload a file in some way? As soon as I can, I will download pdffonts which is not installed on my pc and I'll let you know.Bromeosin
I upload my dummy example files at dropbox.com/sh/l3nmw23ycfs2s8e/W5bdqjXOikBromeosin
H
7

You could try to 'repair' your pdftk-concatenated PDF using Ghostscript (but use a recent version, such as 9.05). In many cases Ghostscript will be able to merge the many subsetted fonts into fewer ones.

The command would look like this:

gswin32c.exe ^
    -o output.pdf ^
    -sDEVICE=pdfwrite ^
    -dPDFSETTINGS=/prepress ^
     input.pdf

Check with

pdffonts.exe  output.pdf
pdffonts.exe  input.pdf 

how many instances of various font subsets are in each file (pdffonts.exe is available here as part of a small package of commandline tools).

But don't complain about the 'slow speed' of this process -- Ghostscript does interprete completely all PDF input files to accomplish its task, while the pdftk file concatenation is a much simpler process...


Update:

Instead of pdftk you could use Ghostscript to merge your input PDF files. This could possibly avoid the problem you was seeing with the a posteriori Ghostscript 'repair' of your pdftk-merged files. Note, this will be much slower than the 'dumb' pdftk merge. However, the results may please you better, especially regarding the font handling and file size.

This would be a possible command:

gswin32c.exe ^
    -o output.pdf ^
    -sDEVICE=pdfwrite ^
    -dPDFSETTINGS=/prepress ^
     input.pdf

You can add more options to the Ghostscript CLI for a more fine-tuned control over the merge and optimization process.

In the end you'll have to decide between the extremes:

  • 'Fast' pdftk producing large output files, vs.
  • 'Slow' gswin32c.exe (Ghostscript) producing lean output files.

I'd be interested if you would post some results (execution time and resulting file sizes) for both methods for a number of your merge processes...


Update 2: Sorry, my previous version contained a typo.
It's not -sPDFSETTINGS=... but it must be -dPDFSETTINGS=... (d in place of s).


Update 3:

Since your source files are Excel sheets made from templates (which usually don't use a lot of different fonts), you could try to use a trick to make sure Ghostscript has all the required glyphs of the fonts used in all to-be-merged-later PDFs:

  • For each font and face (standard, italic, bold, bold-italic) add a table cell into your template sheet at the top left of your print area.
  • Fill this table cell with all printable characters and punctuation signs from the ASCII alphabet: 0123456789, ABCD...XYZ, abc...xyz, :-_;°%&$§")({}[] etc.
  • Make the cell (and the fontsize) as small as you want or need in order to not disturb your overall layout. Use the color white to format the characters in the cell (so they appear invisible in the final PDF).

This method will hopefully make sure that each of your PDFs will use the same subset of glyphs which would then avoid the problems you observed when merging the files with Ghostscript. (Note, that you if you use f.e. Arial and Arial-Italic, you have to create 2 such cells: one formatted with the standard Arial typeface, the other one with the italic one.)

Hotchpot answered 17/5, 2012 at 9:28 Comment(8)
Thanks pipitas: your solution seems to work pretty well, but: 1) during the generation process receive several warning (**** Warning: considering '0000000000 XXXXX n' as a free entry.) 2) at the end I get this comment: **** This file had errors that were repaired or ignored. **** The file was produced by: **** >>>> itext-paulo (lowagie.com)[JDK1.1] - build 132 <<<< 3) when I open the file in acrobat reader I receive "Cannot extract the embedded font 'ZJRYHZ+Calibri+Bold'. Some characters may not display or print correctly" - and, in fact, some chars are not displaied.Bromeosin
Thanks again pipitas: I tried to use gs instead of pdftk to merge the pdfs, but the result is the same: the file is small (as in the excel->pdftk->gs process) but some characters are missing (better, they are present but they are not rendered). I used the following command: gswin64 -sPDFSETTINGS=prepress -dBATCH -sDEVICE=pdfwrite -sOutputFile=output.pdf pdffile1.pdf pdffile2.pdf. I also tried to add the missing chars in the first page and then they are present in all document. I think the problem is related to the fact that the font are subsetted (and not embedded) as explained by KenS.Bromeosin
I can replicate my problem in this way: I created 2 new word document (word 2010, win7 64bit), the first with the letter "a", the second with the character "%" and save them as a.pdf and b.pdf. Run the command "gswin64 -sPDFSETTINGS=prepress -dBATCH -sDEVICE=pdfwrite -sOutputFile=output_gs.pdf a.pdf b.pdf". I get a file where, in the second page, the "%" is not rendered.Bromeosin
pipitas, tried with "gswin64 -dPDFSETTINGS=/prepress -dBATCH -sDEVICE=pdfwrite -sOutputFile=output_gs.pdf a.pdf b.pdf" and "gswin64 -sPDFSETTINGS=prepress -dBATCH -sDEVICE=pdfwrite -sOutputFile=output_gs.pdf a.pdf b.pdf": same result (btw, what is the difference between "-sPDFSETTINGS=prepress" and "-dPDFSETTINGS=/prepress"?). Did you have a look at my uploaded files (I upload my dummy example files at dropbox.com/sh/l3nmw23ycfs2s8e/W5bdqjXOik)?Bromeosin
@AleV: -sPDFSET... is the wrong syntax and will not have the wanted effect. It must be -dPDFSET.... Also there must have the leading slash at =/prepress.Hotchpot
@AleV: which version of Ghostscript have you installed? Try gswin64c.exe -v to find out. Did you run the commands in a 'DOS box'?Hotchpot
@AleV: the .bat file on your dropbox still should be corrected to use -dPDFSETTINGS=/prepress. Also use the gswin64c (c for commandline) instead of the gswin64 (which may pop up the GS GUI). I'd recommend as the full command: gswin64c.exe -dPDFSETTINGS=/prepress -sDEVICE=pdfwrite -o output_gs.pdf C:/TMP/testgs/a.pdf C:/TMP/testgs/b.pdf. (Ghostscript accepts single forward slashes on Windows paths, and the -o (for output) is shorter and spares you from having to add -dBATCH -dNOPAUSE.)Hotchpot
your trick works, thanks. I still wonder if there is a way to embed (and not subset) a font while exporting from excel to pdf. I guess this would solve the problem. Anyway, thanks a lot for your support, Ale.Bromeosin
R
3

Fonts are usually subset when creating PDF files, so that they only contain the required glyphs. In addition, the encoding is altered so that the first glyph used is assigned character code 1, the second is 2 and so on.

As a result the first PDF file might contain a font where 0x01 = A, 0x02 = space, 0x03 = t, 0x04 = e and 0x05 = s. The second file might contain a font where 0x01 = T, 0x02 = e, 0x03 =s, 0x04 = t

In order not to get confused, a prefix is added to the name of the font in the document. This prefix is stripped out by Acrobat when displaying the font embedding, so it seems like you have multiple instances of the same font. However they are in fact different font, and cannot readily be combined.

Assuming this is the case (and I would need to see your files to be sure) it 'may' be possible to avoid this. If you set the PDF producing software so that it does not subset fonts then pdftk might be able to merge the documents without including the same font multiple times. I haven't tested this obviously, but it might work. Your other option is to modify your workflow so that the reports are produced as multiple page documents in the first place.

Replenish answered 17/5, 2012 at 7:25 Comment(1)
Thanks KenS. Option 2 is impractical for my scenario: i tried 1) to merge the excel single report into a single excel file with multiple sheets --> a nightmare due to the presence of pivot table, tables, named ranges and linked charts whose names and references create conflicts or get lost. 2) to paste the ranges in a word document --> it works, but the copypaste operation is unsatisfactory. Option 1 seems to be promising, but I don't know how to embed (and not subset) the font while creating the pdf: I didn't find this option in the excel saveas pdf options (I don't have distiller).Bromeosin

© 2022 - 2024 — McMap. All rights reserved.