How can I extract embedded fonts from a PDF as valid font files?
Asked Answered
H

8

201

I'm aware of the pdftk.exe utility that can indicate which fonts are used by a PDF, and whether they are embedded or not.

Now the problem: given I had PDF files with embedded fonts -- how can I extract those fonts in a way that they are re-usable as regular font files? Are there (preferably free) tools which can do that? Also: can this be done programmatically with, say, iText?

Huan answered 15/8, 2010 at 15:37 Comment(0)
C
475

You have several options. All these methods work on Linux as well as on Windows or Mac OS X. However, be aware that most PDFs do not include to full, complete fontface when they have a font embedded. Mostly they include just the subset of glyphs used in the document.


Using pdftops

One of the most frequently used methods to do this on *nix systems consists of the following steps:

  1. Convert the PDF to PostScript, for example by using XPDF's pdftops (on Windows: pdftops.exe helper program.
  2. Now fonts will be embedded in .pfa (PostScript) format + you can extract them using a text editor.
  3. You may need to convert the .pfa (ASCII) to a .pfb (binary) file using the t1utils and pfa2pfb.
  4. In PDFs there are never .pfm or .afm files (font metric files) embedded (because PDF viewer have internal knowledge about these). Without these, font files are hardly usable in a visually pleasing way.

Using fontforge

Another method is to use the Free font editor FontForge:

  1. Use the "Open Font" dialogbox used when opening files.
  2. Then select "Extract from PDF" in the filter section of dialog.
  3. Select the PDF file with the font to be extracted.
  4. A "Pick a font" dialogbox opens -- select here which font to open.

Check the FontForge manual. You may need to follow a few specific steps which are not necessarily straightforward in order to save the extracted font data as a file which is re-usable.


Using mupdf

Next, MuPDF. This application comes with a utility called pdfextract (on Windows: pdfextract.exe) which can extract fonts and images from PDFs. (In case you don't know about MuPDF, which still is relatively unknown and new: "MuPDF is a Free lightweight PDF viewer and toolkit written in portable C.", written by Artifex Software developers, the same company that gave us Ghostscript.)
(Update: Newer versions of MuPDF have moved the former functionality of 'pdfextract' to the command 'mutool extract'. Download it here: mupdf.com/downloads)

Note: pdfextract.exe is a command-line program. To use it, do the following:

c:\>  pdfextract.exe  c:\path\to\filename.pdf         # (on Windows)
$>    pdfextract  /path/tofilename.pdf                # (on Linux, Unix, Mac OS X)

This command will dump all of the extractable files from the pdf file referenced into the current directory. Generally you will see a variety of files: images as well as fonts. These include PNG, TTF, CFF, CID, etc. The image names will be like img-0412.png if the PDF object number of the image was 412. The fontnames will be like FGETYK+LinLibertineI-0966.ttf, if the font's PDF object number was 966.

CFF (Compact Font Format) files are a recognized format that can be converted to other formats via a variety of converters for use on different operating systems.

Again: be aware that most of these font files may have only a subset of characters and may not represent the complete typeface.

Update: (Jul 2013) Recent versions of mupdf have seen an internal reshuffling and renaming of their binaries, not just once, but several times. The main utility used to be a 'swiss knife'-alike binary called mubusy (name inspired by busybox?), which more recently was renamed to mutool. These support the sub-commands info, clean, extract, poster and show. Unfortunatey, the official documentation for these tools isn't up to date (yet). If you're on a Mac using 'MacPorts': then the utility was renamed in order to avoid name clashes with other utilities using identical names, and you may need to use mupdfextract.

To achieve the (roughly) equivalent results with mutool as its previous tool pdfextract did, just run mubusy extract ....*

So to extract fonts and images, you may need to run one of the following commandlines:

c:\>  mutool.exe extract filename.pdf      # (on Windows)
$>    mutool     extract filename.pdf      # (on Linux, Unix, Mac OS X)

Downloads are here: mupdf.com/downloads


Using gs (Ghostscript)

Then, Ghostscript can also extract fonts directly from PDFs. However, it needs the help of a special utility program named extractFonts.ps, written in PostScript language, which is available from the Ghostscript source code repository.

Now use it, you need to run both, this file extractFonts.ps and your PDF file. Ghostscript will then use the instructions from the PostScript program to extract the fonts from the PDF. It looks like this on Windows (yes, Ghostscript understands the 'forward slash', /, as a path separator also on Windows!):

gswin32c.exe                  ^
  -q -dNODISPLAY              ^
   c:/path/to/extractFonts.ps ^
  -c "(c:/path/to/your/PDFFile.pdf) extractFonts quit"

or on Linux, Unix or Mac OS X:

gs                          \
  -q -dNODISPLAY            \
   /path/to/extractFonts.ps \
  -c "(/path/to/your/PDFFile.pdf) extractFonts quit"

I've tested the Ghostscript method a few years ago. At the time it did extract *.ttf (TrueType) just fine. I don't know if other font types will also be extracted at all, and if so, in a re-usable way. I don't know if the utility does block extracting of fonts which are marked as protected.


Using pdf-parser.py

Finally, Didier Stevens' pdf-parser.py: this one is probably not as easy to use, because you need to have some know-how about internal PDF structures. pdf-parser.py is a Python script which can do a lot of other things too. It can also decompress and extract arbitrary streams from objects, and therefore it can extract embedded font files too.

But you need to know what to look for. Let's see it with an example. I have a file named big.pdf. As a first step I use the -s parameter to search the PDF for any occurrence of the keyword FontFile (pdf-parser.py does not require a case sensitive search):

pdf-parser.py -s fontfile big.pdf

In my case, for my big1.pdf, I get this result:

obj 9 0
 Type: /FontDescriptor
 Referencing: 15 0 R
  <<   
    /Ascent 728
    /CapHeight 716
    /Descent -210 
    /Flags 32
    /FontBBox [ -665 -325 2000 1006 ]
    /FontFile2 15 0 R
    /FontName /ArialMT
    /ItalicAngle 0
    /StemV 87
    /Type /FontDescriptor
    /XHeight 519
  >>   

obj 11 0 
 Type: /FontDescriptor
 Referencing: 16 0 R
  <<   
    /Ascent 728
    /CapHeight 716
    /Descent -210 
    /Flags 262176
    /FontBBox [ -628 -376 2000 1018 ]
    /FontFile2 16 0 R
    /FontName /Arial-BoldMT
    /ItalicAngle 0
    /StemV 165
    /Type /FontDescriptor
    /XHeight 519
  >>   

It tells me that there are two instances of FontFile2 inside the PDF, and these are in PDF objects no. 15 and no. 16, respectively. Object no. 15 holds the /FontFile2 for font /ArialMT, object no. 16 holds the /FontFile2 for font /Arial-BoldMT.

To show this more clearly:

pdf-parser.py -s fontfile big1.pdf | grep -i fontfile
  /FontFile2 15 0 R
  /FontFile2 16 0 R

A quick peeking into the PDF specification reveals the the keyword /FontFile2 relates to a 'stream containing a TrueType font program' (/FontFile would relate to a 'stream containing a Type 1 font program' and /FontFile3 would relate to a 'stream containing a font program whose format is specified by the Subtype entry in the stream dictionary' {hence being either a Type1C or a CIDFontType0C subtype}.)

To look specifically at PDF object no. 15 (which holds the font /ArialMT), one can use the -o 15 parameter:

pdf-parser.py -o 15 big1.pdf

 obj 15 0
  Type: 
  Referencing: 
  Contains stream
   <<
     /Length1 778552
     /Length 1581435
     /Filter /ASCIIHexDecode
   >>

This pdf-parser.py output tells us that this object contains a stream (which it will not directly display) that has a length of 1.581.435 Bytes and is encoded ( == "compressed") with ASCIIHexEncode and needs to be decoded ( == "de-compressed" or "filtered") with the help of the standard /ASCIIHexDecode filter.

To dump any stream from an object, pdf-parser.py can be called with the -d dumpname parameter. Let's do it:

pdf-parser.py -o 15 -d dumped-data.ext big1.pdf

Our extracted data dump will be in the file named dumped-data.ext. Let's see how big it is:

ls -l dumped-data.ext
  -rw-r--r--  1 kurtpfeifle  staff  1581435 Apr 11 00:29 dumped-data.ext

Oh look, it is 1.581.435 Bytes. We saw this figure in the previous command's output. Opening this file with a text editor confirms that its content is ASCII hex encoded data.

Opening the file with a font reading tool like otfinfo (this is a part of the lcdf-typetools package) will lead to some disappointment at first:

otfinfo -i dumped-data.ext
  otfinfo: dumped-data.ext: not an OpenType font (bad magic number)

OK, this is because we did not (yet) let pdf-parser.py make use of its full magic: to dump a filtered, decoded stream. For this we have to add the -f parameter:

pdf-parser.py -o 15 -f -d dumped-data-decoded.ext big1.pdf

What's the size is this new file?

ls -l dumped-data-decoded.ext
  -rw-r--r--  1 kurtpfeifle  staff  778552 Apr 11 00:39 dumped-data-decoded.ext

Oh, look: that exact number was also already stored in the PDF object no. 15 dictionary as the value for key /Length1...

What does file think it is?

file dumped-data-decoded.ext
  dumped-data-decoded.ext: TrueType font data

What does otfinfo tell us about it?

otfinfo -i dumped-data-decoded.ext
  Family:              Arial
  Subfamily:           Regular
  Full name:           Arial
  PostScript name:     ArialMT
  Version:             Version 5.10
  Unique ID:           Monotype:Arial Regular:Version 5.10 (Microsoft)
  Designer:            Monotype Type Drawing Office - Robin Nicholas, Patricia Saunders 1982
  Manufacturer:        The Monotype Corporation
  Trademark:           Arial is a trademark of The Monotype Corporation.
  Copyright:           © 2011 The Monotype Corporation. All Rights Reserved.
  License Description: You may use this font to display and print content as permitted by
                       the license terms for the product in which this font is included.
                       You may only (i) embed this font in content as permitted by the 
                       embedding restrictions included in this font; and (ii) temporarily 
                       download this font to a printer or other output device to help
                       print content.
  Vendor ID:           TMC

So Bingo!, we have a winner: pdf-parser.py did indeed extract a valid font file for us. Given the size of this file (778.552 Bytes), it looks like this font had been embedded even completely in the PDF...

We could rename it to arial-regular.ttf and install it as such and happily make use of it.


Caveats:

  • In any case you need to follow the license that applies to the font. Some font licences do not allow free use and/or distribution. Pirating fonts is like pirating any software or other copyrighted material.

  • Most PDFs which are in the wild out there do not embed the full font anyway, but only subsets. Extracting a subset of a font is only useful in a very limited scope, if at all.

Please do also read the following about Pros and (more) Cons regarding font extraction efforts:

Cochlea answered 15/8, 2010 at 20:30 Comment(30)
@kizzx2: feel free to upvote or downvote any of my other [PDF] or [Ghostscript] answers :-)Cochlea
If you are on Mac and install mupdf from ports (or perhaps from binary too), the extraction too is called mupdfextract. You can run it from terminal, as long as it's in the path.Sweetbrier
@Orwellophile: thanks for the hint. I took it as an opportunity to update some of my hints about mupdf. See also this...Cochlea
I'll check them out. And just so this isn't a pointless comment: Your process worked AWESOMELY... (voted up)... it extracted and named 3 variations of the font, and then I used fontforge (also free from macports) to merge. Unfortunately my font is still missing the capital letter "X"... What are the odds :pSweetbrier
@Orwellophile: Your merge result of the 3 subsets most likely doesn't contain an "X" because none of the 3 substes contains an "X".Cochlea
Wow. Really. I am in awe. Please, help me keep from gaping at such an awesome answer.. [gape] Thanks!Snore
I couldn't get gs to work (it said this and that file was missing...), and fontforge is a one-by-one thing, very slow. However mupdf did the trick for me. I ran this command (the package is appareently updated): mutool.exe extract c:\path\to\mypdf.pdf - it dumped everything - fonts, images, etc. right out into the folder where the pdf was located. Now, I've got all the fonts!Snore
Brilliantly detailed answer. Just needed to identify (rather than extract fonts embedded in a PDF) and so pdf-parser.ps worked perfectly.Step
SumatraPDF v3.1.1 doesn't ship pdfextract.exe.Cloaca
@koppor: You are right. I updated my answer accordingly. You can download MuPDF and mutool here: mupdf.com/downloads.Cochlea
yum install mupdf; mutool extract my.pdf - Couldn't get any simpler.Inject
@JonathonReinhart: Sorry then; I mistook your comment for sarcasm and obviously was wrong. (Not sure though, if yum install mupdf does in clude mutool -- didn't RedHat/Fedora put it into a different package?)Cochlea
@KurtPfeifle No problem! Nope - on Fedora 22: $ dnf whatprovides '*mutool*' yields mupdf-1.7-2.fc22.x86_64.Inject
For the using pdftops option, here is a way to extract the font information from the ps file: encrypted.pcode.nl/blog/2006/06/19/extracting-fonts-from-pdfs. Basically, extract out the lines from %%BeginResource: font [name] to %%EndResource each to a new file.Imperturbation
I just stoped at stage 2 of the first method : that gave me a pfa file that was recognized as an installable font file by Ubuntu. Thanks a lot !Rust
@KurtPfeifle "Most PDFs which are in the wild out there do not embed the full font anyway, but only subsets. Extracting a subset of a font is only useful in a very limited scope, if at all." is there any documentation how to extract fonts from these subsets or how to generate such subsets?Redo
@juFo: Sorry, I don't know how to better explain the topic of "font extraction from PDFs" than this answer did. Are you sure you understood its content? There is no way to "extract fonts from these subsets". A font subset is a valid font as well -- but it does not contain glyphs for all characters, only glyphs for those characters which were actually used in the document.Cochlea
@KurtPfeifle well i've read comments where they say that bold/italic should always be added as last in embedded (subset) fonts. I'm looking for documentation that specifies these "restrictions". (or is there no such documentation on how to embed or correctly extract embedded subsets? is it just trial-error ?)Redo
@juFo: bold/italic typefaces usually are separate fonts (even though they may belong to the same font family).Cochlea
@KurtPfeifle but in pdf some fonts have the bold/italic subsets in a single file (file when extracted). I'm interested in that, but can't find documentation about this specific case.Redo
@juFo: As I said, and as it does not seem to achieve the right end-point in your mind: full bold/italic typefaces usually are separate fonts. So if they are embedded as font subsets, they are also separate then. Or maybe I completely mis-understand what you are trying to say? If in doubt, just point me to an example PDF, and I'll investigate the fonts embedded therein....Cochlea
After testing all those methods, FontForge was the only successful one.Abreast
@MGR: I can guarantee you that several of the above methods will work, for any PDF file you give me. You must have done something wrong.Cochlea
@KurtPfeifle: In my case the font objects generated using other methods were found to be invalid font files. When using pdfparser I noticed that the length value shown was not correct, it was showing another reference (like 18 0 R). I am using Windows 7 64bit.Abreast
@MGR: So? In this case pdfparser helped you understand the structure of the PDF even though it had faults. Which means, pdfparser could help you extract the invalid (or valid) font file. (Which may means you may have to fix the potentially invalid font with other means....)Cochlea
@KurtPfeifle: I didn't understand the logic. I could get the font file extracted with FontForge and it is valid. Do you mean that FontForge made the font file valid automatically?Abreast
@MGR: Yes, very likely FontForge silently repaired the font file when it opened it. Most likely also, when you saved it from font forge it asked you to confirm changes it made.Cochlea
Extremely useful info thank you. If FontForge reports several fonts such as MSCIYG+Ge'ez-1, UJFBHE+Ge'ez-1, etc, is this several subsets of the same font? Do any of these options automatically merge them? (I am trying manually with FontForge, and failing slowly, considering learning its CLI as I have a lot of PDFs and a lot of fonts to extract, but hoping not to have to..)Sundowner
@Chris: yes, these are two different subsets (which may overlap in a huge part even).There is no option to automatically merge them.Cochlea
For the gs solution, I needed to add the flag -dNOSAFER to aviod an Error: /invalidfileaccess in --file-- message. Otherwise this worked like a charm!Booth
T
31

Use online service http://www.extractpdf.com. No need to install anything.

Trant answered 22/5, 2014 at 11:39 Comment(3)
In my case, it could only extract Type 1 fonts and not TrueTypeCloaca
I have extracted fonts using this site and copied it at ~/.fonts, and the copy and paste was working!Ideate
For me, it didn't find any fonts, even though Acrobat Reader lists the fonts.Fallacious
C
21

Even though this question is 10 years old, it is still valid and as technology changes so does a valid answer.

In searching the current answers noticed none of them note WOFF (Web Open Font Format) (W3C) (Wikipedia) which can be used to recreate the individual characters (glyphs) and display them in a web page accurately.

Using the free online web page by IDR Solutions, PDF to HTML5 (link), convert a PDF to a zip file. In the resulting zip will be a font directory of woff file types. Current Internet browsers support woff files if you were not aware. (reference) These can be examined at the online site FontDrop! (link).

WOFF files can be converted to/from OTF or TTF at WOFFer – WOFF font converter

Also the zip file from PDF to HTML5 will contain an HTML file for each page of the PDF that can be opened in an Internet browser and is one of the best and most accurate PDF translations I have found or seen.

Criminology answered 31/12, 2019 at 22:51 Comment(9)
Thank you! This solution works for me (as in creating a valid TTF) whereas the other ones I've tried don't. Is it because WOFF handles incomplete fonts better?Cullet
@Cullet Is it because WOFF handles incomplete fonts better? I have no idea. Your guess would be as good a mine. As I noted I am just learning about WOFF myself.Criminology
@Cullet Perhaps you should post Is it because WOFF handles incomplete fonts better? as a new SO question and others with more knowledge will see and hopefully provide a meaningful answer.Criminology
I might do that. Thanks.Cullet
FYI Adobe will disable support for authoring with Type 1 fonts in January 2023Criminology
For an example of woff2 see JetBrains MonoCriminology
This just converted the vector PDF to JPG images.Fallacious
Hi @GuyCoder how can I convert the generated HTML files into a word/text file?Paresthesia
@Paresthesia how can I convert the generated HTML files into a word/text file? That is a new question, with a good answer for converting PDF directly to words. See: GhostScript?Criminology
G
8

Eventually found the FontForge Windows installer package and opened the PDF through the installed program. Worked a treat, so happy.

Gangrel answered 20/3, 2012 at 18:30 Comment(2)
The latest page can be found here: fontforgebuilds.sourceforge.netCrusty
For me, it didn't find any fonts, even though Acrobat Reader lists the fonts.Fallacious
H
6

http://www.verypdf.com/app/pdf-font-extractor/pdf-font-extracting-tool.html IMO easiest way to extract fonts (Windows).

Hettie answered 17/2, 2014 at 10:27 Comment(0)
L
4

PDF2SVG version 6.0 from PDFTron does a reasonable job. It produces OpenType (.otf) fonts by default. Use --preserve_fontnames to preserve "the font/font-family naming scheme as obtained from the source file."

PDF2SVG is a commercial product, but you can download a free demo executable (which includes watermarks on the SVG output but doesn't otherwise restrict usage). There may be other PDFTron products that also extract fonts, but I only recently discovered PDF2SVG myself.

Luciennelucier answered 26/12, 2013 at 12:5 Comment(1)
Unfortunately --preserve_fontnames doesn't work if you have overlapping, partial fonts - it seems not to include the prefix, eg, the MSCIYG in MSCIYG+Ge'ez-1, so overwrites prior partials.Sundowner
A
3

One of the best online tools currently available to extract pdf fonts is http://www.pdfconvertonline.com/extract-pdf-fonts-online.html

Alford answered 12/5, 2016 at 14:49 Comment(1)
For me, it didn't find any fonts, even though Acrobat Reader lists the fonts.Fallacious
B
0

This is a followup to the font-forge section of @Kurt Pfeifle's answer, specific to Red Hat (and possibly other Linux distros).

  1. After opening the PDF and selecting the font you want, you will want to select "File -> Generate Fonts..." option.
  2. If there are errors in the file, you can choose to ignore them or save the file and edit them. Most of the errors can be fixed automatically if you click "Fix" enough times.
  3. Click "Element -> Font Info...", and "Fontname", "Family Name" and "Name for Humans" are all set to values you like. If not, modify them and save the file somewhere. These names will determine how your font appears on the system.
  4. Select your file name and click "Save..."

Once you have your TTF file, you can install it on your system by

  1. Copying it to folder /usr/share/fonts (as root)
  2. Running fc-cache -f /usr/share/fonts/ (as root)
Bigley answered 19/3, 2019 at 21:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.