/Differences dictionary for encode parsing issue in PDF
Asked Answered
S

2

3

Type1 font /Differences encoding uses strings in mapping of values for example 1 character is encoded to 'one'. It is used for numbers and special characters only.

What is the standard way to use these encoding?

How should I decode string from PDF which uses such encoding?

Link for the file: http://www.filedropper.com/open

Sharl answered 18/5, 2015 at 10:30 Comment(1)
Basically above document uses PDFLATEX. tex.stackexchange.com/questions/33476/… so all combinations such as fi,fl,ffl etc. will be mapped to their corresponding glyphs. How can I figure out PDF uses such encoding?Sharl
D
5

Here's the /Differences array in your file (and honestly, you should have just posted this and not a link a skeevy download page):

/Differences [
    24 /breve/caron/circumflex/dotaccent/hungarumlaut/ogonek/ring/tilde
    39 /quotesingle
    96 /grave
    128 /bullet/dagger/daggerdbl/ellipsis...
]

The way this works is that the font also has an encoding associated with it (for example /MacRoman or /WinANSI). In the case of a Type 1 font, there is an encoding built into the font. Then given a copy of that encoding, you apply the differences to it. Start from the number (your first is 24), you change entries 24-31 inclusive to /breve, /circumflex and so on.

In Type 1 fonts, there is a dictionary called /CharStrings, which an association of a name of a glyph with the data/code that will render it. If, for example, you get a character with code 26, you look it up in your encoding array (which should be a 256 element array for Type 1 fonts) and with the differences applied, you get the name /circumflex. You then look that up in the CharStrings dictionary, pull out the glyph data and render it. Any character that does not exist in the encoding should be set to /.notdef which will then render an shape representing an undefined character (usually an empty box).

Now likely your problem is, how do I turn these glyph names in something that is more useful like, say Unicode?

If you look in Annex D, you'll see a set of tables that define the character sets for standard Latin encodings. You would make a lookup table that maps Adobe standard names to Unicode. Unfortunately, the tables in Annex D are incomplete. Fortunately, Adobe has a file that defines all of that for you here. There is a link in that file which is now dead, but most likely it was meant to go here.

Dido answered 18/5, 2015 at 13:39 Comment(4)
Here's the /Differences array in your file - Actually there are multiple Differences arrays in the file. The one you show merely is the only one not packed into an object stream.Lanta
I am facing another problem in this. The thing is I want to use '/fi' name as it is because that mapping is used in representation of word 'file' in above PDF but when I use Adobe's standard glyph list it is again converted back to some unicode given in that mapping, but I want to use for numeric, punctuational and non-ascii characters. My question is how do I decide when to use these standard mappings provided by Adobe and when not to use them.Sharl
If a standard encoding doesn't do what you want, create a /Differences with characters that you do want and track them in your code. If you want the fi ligature, put it somewhere that is convenient.Dido
Basically this pdf is LaTex PDF which uses the above encoding. How can figure out if the pdf use such encoding?Sharl
L
3

How should I decode string from PDF which uses such encoding?

As the specification explains:

9.10.2 Mapping Character Codes to Unicode Values

A conforming reader can use these methods, in the priority given, to map a character code to a Unicode value. Tagged PDF documents, in particular, shall provide at least one of these methods:

  • If the font dictionary contains a ToUnicode CMap, use that CMap to convert the character code to Unicode.

  • If the font is a simple font that uses one of the predefined encodings MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, or that has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font:

    a) Map the character code to a character name according to Table D.1 and the font’s Differences array.

    b) Look up the character name in the Adobe Glyph List to obtain the corresponding Unicode value.

  • If the font is a composite font ... (not applicable in your case)

If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.

(ISO 32000-1)

First of all, therefore, you should look for a ToUnicode map.

If there is none (as in case of your sample document), use the Encoding (predefined or differences).

And if your code is not mapped to something proper in the encoding, there according to the spec is no way to determine what the character code represents!

If the font in question is embedded, you might yet have a way out by parsing the embedded font program which may include its own mapping to Unicode.

Otherwise, though, this is where you can start guessing (or delegate to OCR).


But your assumption

It is used for numbers and special characters only.

already is wrong. If you look at your sample document, e.g. the two fonts F25 and F26 used on the first page of your document have a Differences array like this:

0 /.notdef 1 /dotaccent /fi /fl /fraction /hungarumlaut /Lslash /lslash /ogonek /ring 10 /.notdef 11 /breve /minus 13 /.notdef 14 /Zcaron /zcaron /caron /dotlessi /dotlessj /ff /ffi /ffl 22 /.notdef 30 /grave /quotesingle /space /exclam /quotedbl /numbersign /dollar /percent /ampersand /quoteright /parenleft /parenright /asterisk /plus /comma /hyphen /period /slash /zero /one /two /three /four /five /six /seven /eight /nine /colon /semicolon /less /equal /greater /question /at /A /B /C /D /E /F /G /H /I /J /K /L /M /N /O /P /Q /R /S /T /U /V /W /X /Y /Z /bracketleft /backslash /bracketright /asciicircum /underscore /quoteleft /a /b /c /d /e /f /g /h /i /j /k /l /m /n /o /p /q /r /s /t /u /v /w /x /y /z /braceleft /bar /braceright /asciitilde 127 /.notdef 130 /quotesinglbase /florin /quotedblbase /ellipsis /dagger /daggerdbl /circumflex /perthousand /Scaron /guilsinglleft /OE 141 /.notdef 147 /quotedblleft /quotedblright /bullet /endash /emdash /tilde /trademark /scaron /guilsinglright /oe 157 /.notdef 159 /Ydieresis 160 /.notdef 161 /exclamdown /cent /sterling /currency /yen /brokenbar /section /dieresis /copyright /ordfeminine /guillemotleft /logicalnot /hyphen /registered /macron /degree /plusminus /twosuperior /threesuperior /acute /mu /paragraph /periodcentered /cedilla /onesuperior /ordmasculine /guillemotright /onequarter /onehalf /threequarters /questiondown /Agrave /Aacute /Acircumflex /Atilde /Adieresis /Aring /AE /Ccedilla /Egrave /Eacute /Ecircumflex /Edieresis /Igrave /Iacute /Icircumflex /Idieresis /Eth /Ntilde /Ograve /Oacute /Ocircumflex /Otilde /Odieresis /multiply /Oslash /Ugrave /Uacute /Ucircumflex /Udieresis /Yacute /Thorn /germandbls /agrave /aacute /acircumflex /atilde /adieresis /aring /ae /ccedilla /egrave /eacute /ecircumflex /edieresis /igrave /iacute /icircumflex /idieresis /eth /ntilde /ograve /oacute /ocircumflex /otilde /odieresis /divide /oslash /ugrave /uacute /ucircumflex /udieresis /yacute /thorn /ydieresis

which contains mappings for normal uppercase /A../Z and lowercase /a../z characters, too.


By the way,

Type1 font /Differences encoding uses strings in mapping of values for example 1 character is encoded to 'one'.

is not strictly correct, the '/' characters are part of the respective mapped value, e.g. /one, and as PDF objects these are not Strings but Names.

Lanta answered 18/5, 2015 at 13:19 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.