PHP Filter FlateDecode PDF stream returning offset characters
Asked Answered
Q

2

0

I have code that extracts text from a PDF using a filetotext class. Worked until last week when something changed in the pdf's being generated. Weird thing is that it appears the characters are there and correct once I add 29 to the ord of the character.

Example response debug printout:

/F1 7.31 Tf
0 0 0 rg
1 0 0 1 195.16 597.4 Tm
($PRXQW)Tj
ET
BT

The code uses gzuncompress on the stream section of the pdf. The $PRXQW is Amount, and adding 29dec to the ord of each character gives me this. But sometimes a character will not be this exact translation, such as what should be a ) in the text appears to be two bytes of 5C66.

Just wondering about this code ring type of character coming out of PDF's now and if anyone has seen this kind of thing?

Quicken answered 13/8, 2015 at 23:33 Comment(2)
Some sort of subsetting or custom encoding has been introduced with the font referred to by /F1, The defnitiion of that font in the PDF may shed some light, or look at the font settings at PDF generation.Eakins
@dwarring, thank you, this helped me move forward.Quicken
M
2

The encoding of the string argument of the Tj operation depends entirely on the PDF font used (F1 in the case at hand):

A string operand of a text-showing operator shall be interpreted as a sequence of character codes identifying the glyphs to be painted.

With a simple font, each byte of the string shall be treated as a separate character code. The character code shall then be looked up in the font’s encoding to select the glyph, as described in 9.6.6, "Character Encoding".

With a composite font (PDF 1.2), multiple-byte codes may be used to select glyphs. In this instance, one or more consecutive bytes of the string shall be treated as a single character code. The code lengths and the mappings from codes to glyphs are defined in a data structure called a CMap, described in 9.7, "Composite Fonts".

(section 9.4.3 "Text-Showing Operators" in ISO 32000-1)

The OP's code seems to assume a standard encoding like MacRomanEncoding or WinAnsiEncoding, but these merely are special cases. As indicated in the quote above, the encoding might as well be some ad-hoc mixed multibyte encoding.

The PDF specification in a later section describes how to properly extract text:

A conforming reader can use these methods, in the priority given, to map a character code to a Unicode value. Tagged PDF documents, in particular, shall provide at least one of these methods (see 14.8.2.4.2, "Unicode Mapping in Tagged PDF"):

  • If the font dictionary contains a ToUnicode CMap (see 9.10.3, "ToUnicode CMaps"), use that CMap to convert the character code to Unicode.

  • If the font is a simple font that uses one of the predefined encodings MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, or that has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font (see Annex D):

    a) Map the character code to a character name according to Table D.1 and the font’s Differences array.

    b) Look up the character name in the Adobe Glyph List (see the Bibliography) to obtain the corresponding Unicode value.

  • If the font is a composite font that uses one of the predefined CMaps listed in Table 118 (except Identity–H and Identity–V) or whose descendant CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1 character collection:

    a) Map the character code to a character identifier (CID) according to the font’s CMap.

    b) Obtain the registry and ordering of the character collection used by the font’s CMap (for example, Adobe and Japan1) from its CIDSystemInfo dictionary.

    c) Construct a second CMap name by concatenating the registry and ordering obtained in step (b) in the format registry–ordering–UCS2 (for example, Adobe–Japan1–UCS2).

    d) Obtain the CMap with the name constructed in step (c) (available from the ASN Web site; see the Bibliography).

    e) Map the CID obtained in step (a) according to the CMap obtained in step (d), producing a Unicode value.

If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.

(section 9.10.2 "Mapping Character Codes to Unicode Values" in ISO 32000-1)

Thus:

Just wondering about this code ring type of character coming out of PDF's now and if anyone has seen this kind of thing?

Yes, it is fairly common in PDFs from the wild to have text drawing operator string arguments in an encoding entirely different from something ASCII'ish. And as the last paragraph in the second quote above hints at, there are situation not allowing text extraction at all (without OCR, that is), even though there are additional places one can look for the mapping to Unicode.

Marlysmarmaduke answered 14/8, 2015 at 7:52 Comment(4)
Thank you for the detailed explanation. After the comment on my original post I got a hold of the original document that worked and discovered, via document properties, that the encoding in the Font section showed differences. The working file was 'Ansi' and the new file now have 'Identity-H'.Quicken
I will dig for correct mapping, or end up using what I have deciphered so far. It does look like the new files now are using two bytes per character, most likely from this 'Identity-H' encoding.Quicken
Right, Identity-H is a two byte encoding. It maps 2-byte character codes ranging from 0 to 65,535 to the same 2-byte CID value, interpreted high-order byte first.Marlysmarmaduke
All of this pointed me in the right direction. I found the mapping data in the PDF. This particular one has 5 of them in the file and putting them all together seems to give me what i need. My mapping of being a plus 29 decimal is confirmed with these mapping tables, :).Quicken
A
2

What you're seeking to decode the mystery string in the most general case is /Encoding field of the selected font, in your case the font /F1. More than likely, the encoding scheme is /Identity-H, which can contain an arbitrary mapping of 16-bit characters in PDF strings onto UTF-16 characters.

Here is an example from the PDF parser I'm writing. Each page contains a dictionary of resources, which contains a dictionary of fonts:

[&3|0] => Array [
   [/Type] => |/Page|
   [/Resources] => Array [
      [/Font] => Array [
         [/F1] => |&5|0|
         [/F2] => |&7|0|
         [/F3] => |&9|0|
         [/F4] => |&14|0|
         [/F5] => |&16|0|
      ]
   ]
   [/Contents] => |&4|0|
]

In my case, /F3 was producing unusable text, so looking at /F3:

[&9|0] => Array [
    [/Type] => |/Font|
    [/Subtype] => |/Type0|
    [/BaseFont] => |/Arial|
    [/Encoding] => |/Identity-H|
    [/DescendantFonts] => |&10|0|
    [/ToUnicode] => |&96|0|
]

Here you can see the /Encoding type is /Identity-H. The mapping of characters decoding for the decoding chars used in /F3 is stored in the stream referenced by /ToUnicode. Here is the text of relevance from the stream referenced by '&96|0' (96 0 R) - The rest is omitted as boilerplate and can be ignored:

...
beginbfchar
<0003> <0020>
<000F> <002C>
<0015> <0032>
<001B> <0038>
<002C> <0049>
<003A> <0057>
endbfchar
...
beginbfrange
<0044> <0045> <0061>
<0047> <004C> <0064>
<004F> <0053> <006C>
<0055> <0059> <0072>
endbfrange
...
beginbfchar
<005C> <0079>
<00B1> <2013>
<00B6> <2019>
endbfchar
...

The 16-bit pairs between beginbfchar/endbfchar are mappings of individual characters. For example <0003> (0x0003) is mapped onto <0020> (0x0020), which is the space character.

The 16-bit triplets between beginbfrange/endbfrange are mappings of ranges of character. For example characters from <0055> (first) to <0059> (last) are mapped onto <0072>, <0073>, <0074>, <0075> and <0076> ('r' through 'v' in UTF16 & ASCII).

Astrophysics answered 20/3, 2016 at 7:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.