Regarding No Unicode mapping error while parsing pdf
Asked Answered
M

0

6

I have bunch of pdf files (from different sources) and I'd like to extract text from them (unfortunately can't attach the files).

Current parsing outcome:

  1. Tika silently returns text, which is missing a lot of needed data.
  2. Using PDFBox directly gives bunch of warnings (see below) and also strips the data it couldn't recognize
  3. Adobe Acrobat Reader (save as text action) keeps the original document structure, but in place of problematic fonts it places "􀀅􀀆􀀇􀀈􀀆􀀃"

All warnings combined I see so far, from PDFBox:

Aug 06, 2020 3:10:49 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+51 (51) in font AUDQZE+OpenSans-Identity-H

Aug 06, 2020 3:10:49 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+5 (5) in font HCUDUN+DroidSerif-Identity-H

Aug 06, 2020 3:10:49 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+5 (5) in font AUDQZE+OpenSans-Identity-H

Aug 06, 2020 3:10:49 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+55 (55) in font GFEIIG+OpenSans

Aug 06, 2020 3:10:49 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+5 (5) in font GFEIIF+DroidSerif

Aug 06, 2020 3:10:50 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+5 (5) in font GFEIIG+OpenSans

Ideally I'd like to use Tika, as I expect Word and HTML formats as well.

Question: So I was wondering if I could ask Tika (or PDFBox) to use different charset mapping like ASCII, to make sure that those problematic fonts could be parsed with alternative to Unicode table?

Here I found the following:

How come I am getting gibberish(G38G43G36G51G5) when extracting text? This is because the characters in a PDF document can use a custom encoding instead of unicode or ASCII. When you see gibberish text then it probably means that a meaningless internal encoding is being used. The only way to access the text is to use OCR. This may be a future enhancement.

^ Does that mean that PDFBox underneath already checks ASCII charset for a font and I don't have to worry about ASCII at all?

Dumb question: Can I somehow provide the missing Unicode mapping to PDFBox? Using Tika? Say I have limited number of known failing Unicode mappings, can I somehow give such information to the parser?

My case: I already have bunch of documents (so no control over creation) I'd like to run search against, so I thought to create an index of them using Tika. I'd presume most of pdf file would have the issue with the not embedded font. So potentially I've thought about using Tika's Metadata.get(UNMAPPED_UNICODE_CHARS_PER_PAGE) property to decide if them require OCR parsing by Tika (pdfConfig.setOcrStrategy(OCR_STRATEGY.OCR_ONLY)) after NO_OCR parsing. OCR is x10 times slower and less precise (it also reads images like logo and trying to make sense of it, although it's not needed and just adding noise), but I don't see any other options.

Appreciate any thoughts on that, thanks.

UPD

Thanks @mkl for a direction for injecting the font mapping prior to parsing. I wonder if it's possible to automate this assuming I have the fonts (TTF files) the pdf was built with?

The code snippets I use for parsing, if it's any help:

Tika:

static void parseAndPrint(String fileName) throws IOException, TikaException {
    Tika tika = new Tika();
    InputStream is = Main.class.getResourceAsStream(fileName);
    System.out.println(tika.parseToString(is));
}

PDFBox:

static void parseAndPrint(String fileName) throws IOException {
    System.out.println("========= Start File: " + fileName);
    InputStream file = Main.class.getResourceAsStream(fileName);
    PDFTextStripper tStripper = new PDFTextStripper();
    PDDocument document = PDDocument.load(file);
    System.out.print(tStripper.getText(document));
    document.close();
    System.out.println("========= End File: " + fileName);
}
Monumental answered 6/8, 2020 at 4:17 Comment(7)
If you are sure you know mappings to Unicode for each font, it is possible to inject them into the pdf before starting the extraction. But that only makes sense if you are reasonably sure that you have the correct mappings to start with. See here for example.Purim
Unfortunately, the pdf files're already exist and generated not on my sideMonumental
Whops, misinterpreted your comment (thought you were talking about generating new pdf, for some reason), sorry. I think I'll try thatMonumental
Did you try the steps at cwiki.apache.org/confluence/display/TIKA/… ?Grace
No, I didn't. Cool stuff, thanks. It's useful to determine validity of a pdf from TikaMonumental
Does this help? #33414132Adventuresome
@AtnNn, I'll take a look. Currently trying to pre-load COSBase for existing TTFs and then reuse it to inject by font nameMonumental

© 2022 - 2024 — McMap. All rights reserved.