how to add unicode in truetype0font on pdfbox 2.0.0?
Asked Answered
A

1

8

I've been using the PDFBOX version 2.0.0 in a Java project to convert pdfs to text.

several of my pdfs are missing the ToUnicode method, so they come out in Gibberish while I export them.

2016-09-14 10:44:55 WARN org.apache.pdfbox.pdmodel.font.PDSimpleFont(1):322 - No Unicode mapping for 694 (30) in font MPBAAA+F1

in the WARN above, instead of the real character, a gibberish unicode (30) was presented.

I was able to overcome it by editing the additional.txt file in pdfbox, since from trial & error I understood that the code of the character (694 in this case) represents a certain Hebrew letter (צ).

here's a short example of what I've edited inside the file:

-694;05E6 #HexaDecimal value for the letter צ
-695;05E7
-696;05E8

later I've encountered almost the same warning on a different pdf, but instead of gibberish characters I got no characters at all. a more detailed explination of this issue can be seen here - pdf reading via pdfbox in java

2016-09-14 11:07:10 WARN org.apache.pdfbox.pdmodel.font.PDType0Font(1):431 - No Unicode mapping for CID+694 (694) in font ABCDEE+Tahoma,Bold

As you can see, the warning came from a different class (PDType0Font) rather than the first warning (PDSimpleFont), but the code name (694) is the same in both of them and they are both talking about the same character.

Does there's a different file that I should edit other than additional.txt to point the 694 code (the Hebrew letter צ) to it's correct unicode?

Thanks main root

drill down on the first type0 font

Anta answered 14/9, 2016 at 8:34 Comment(8)
Please share the PDF.Critique
I Wish I could, unfortunately it's classified..I'll try to come up with an equivalent document that has the same errorAnta
Alternatively / additionally, you could open the file with PDFDebugger and make a screenshof the font part. It will be below Root/Pages/Kids/[0]/Kids/[0]/Resources/Font . It would be interesting to see "Encoding" and "ToUnicode".Critique
Hi Tilman, I've added the documents inside the original message, Thank you for your help !Anta
Those that have WinAnsiEncoding should be OK. But the others miss "ToUnicode". In the PDF 32000 specification adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/…, read the segment "9.10 Extraction of Text Content".Critique
Hi Tilman, I've read the document that you've sent and I understand that a font needs the ToUnicode method to transform it's glyph into a unicode value. the thing is, the CID codes are constant (for example 694 is the Hebrew letter צ), but I don't physically know where to add these mappings, unlike with the simplefonts that I just had to upadte the "additional.txt" file.. do you know by anychance how can I map the exact same mappings for this problematic font?Anta
I wouldn't touch the "additional.txt" file. What you'd need to do is to add a ToUnicode stream. I.e. you'd have to walk down the structures until you get the "bad" font(s) and then insert a COSStream into the font dictionary with "ToUnicode" key. This COSStream must be filled with a ToUnicode CMap as described in the PDF specification. Not only this is a lot of work (you need to find the mapping for every code, and create a proper table), but the work you do might be good only for one single PDF. Unless all your PDFs come from the same creator, who uses the same font, etc.Critique
Hi Tilman. Thank you for your detailed answer. All of my PDFs are coming from the same creator and I already have the correct mapping (that I've added in the "additional.txt" for my other problematic font), so most of the job is done, I wasn't able to find how to do what you've suggested, can you please explain how to insert the COSStream into the font dictionary with ToUnicode Key? or point me to a page in the PDF? the font name it Tahoma + Tahoma,boldAnta
C
16

Here's some code to add a ToUnicode CMap stream in a font. Obviously I can't do it with your file, so I used one of my test files, which can be found here. I had to work on each entry separately and didn't do all. However the result is good enough to extract the first word in the green print ("Bedingungen").

The scenario is somewhat tailored to you:

  • Identity-H entry
  • no ToUnicode entry
  • specific font name

    try (PDDocument doc = PDDocument.load(f))
    {
        for (int p = 0; p < doc.getNumberOfPages(); ++p)
        {
            PDPage page = doc.getPage(p);
            PDResources res = page.getResources();
            for (COSName fontName : res.getFontNames())
            {
                PDFont font = res.getFont(fontName);
                COSBase encoding = font.getCOSObject().getDictionaryObject(COSName.ENCODING);
                if (!COSName.IDENTITY_H.equals(encoding))
                {
                    continue;
                }
                // get real name
                String fname = font.getName();
                int plus = fname.indexOf('+');
                if (plus != -1)
                {
                    fname = fname.substring(plus + 1);
                }
                if (font.getCOSObject().containsKey(COSName.TO_UNICODE))
                {
                    continue;
                }
                System.out.println("File '" + f.getName() + "', page " + (p + 1) + ", " + fontName.getName() + ", " + font.getName());
                if (!fname.startsWith("Calibri-Bold"))
                {
                    continue;
                }
                COSStream toUnicodeStream = new COSStream();
                try (PrintWriter pw = new PrintWriter(toUnicodeStream.createOutputStream(COSName.FLATE_DECODE)))
                {
                    // "9.10 Extraction of Text Content" in the PDF 32000 specification
                    pw.println ("/CIDInit /ProcSet findresource begin\n" +
                            "12 dict begin\n" +
                            "begincmap\n" +
                            "/CIDSystemInfo\n" +
                            "<< /Registry (Adobe)\n" +
                            "/Ordering (UCS) /Supplement 0 >> def\n" +
                            "/CMapName /Adobe-Identity-UCS def\n" +
                            "/CMapType 2 def\n" +
                            "1 begincodespacerange\n" +
                            "<0000> <FFFF>\n" +
                            "endcodespacerange\n" +
                            "10 beginbfchar\n" + // number is count of entries
                            "<0001><0020>\n" + // space
                            "<0002><0041>\n" + // A
                            "<0003><0042>\n" + // B
                            "<0004><0044>\n" + // D
                            "<0013><0065>\n" + // e
                            "<0012><0064>\n" + // d
                            "<0017><0069>\n" + // i
                            "<001B><006E>\n" + // n
                            "<0015><0067>\n" + // g
                            "<0020><0075>\n" + // u
                            "endbfchar\n" +
                            "endcmap CMapName currentdict /CMap defineresource pop end end");
                }
                font.getCOSObject().setItem(COSName.TO_UNICODE, toUnicodeStream);
            }
        }
        doc.save("huhu.pdf");
    }
    

Btw the unreleased 2.1 version of PDFDebugger has some improved features to show fonts, you can get it here:

You can use it to verify that your ToUnicode CMap makes sense. Here's what I get with my changes: enter image description here

Critique answered 22/9, 2016 at 17:4 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.