I have bunch of pdf files (from different sources) and I'd like to extract text from them (unfortunately can't attach the files).
Current parsing outcome:
- Tika silently returns text, which is missing a lot of needed data.
- Using PDFBox directly gives bunch of warnings (see below) and also strips the data it couldn't recognize
- Adobe Acrobat Reader (save as text action) keeps the original document structure, but in place of problematic fonts it places ""
All warnings combined I see so far, from PDFBox:
Aug 06, 2020 3:10:49 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+51 (51) in font AUDQZE+OpenSans-Identity-H
Aug 06, 2020 3:10:49 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+5 (5) in font HCUDUN+DroidSerif-Identity-H
Aug 06, 2020 3:10:49 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+5 (5) in font AUDQZE+OpenSans-Identity-H
Aug 06, 2020 3:10:49 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+55 (55) in font GFEIIG+OpenSans
Aug 06, 2020 3:10:49 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+5 (5) in font GFEIIF+DroidSerif
Aug 06, 2020 3:10:50 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+5 (5) in font GFEIIG+OpenSans
Ideally I'd like to use Tika, as I expect Word and HTML formats as well.
Question: So I was wondering if I could ask Tika (or PDFBox) to use different charset mapping like ASCII, to make sure that those problematic fonts could be parsed with alternative to Unicode table?
Here I found the following:
How come I am getting gibberish(G38G43G36G51G5) when extracting text? This is because the characters in a PDF document can use a custom encoding instead of unicode or ASCII. When you see gibberish text then it probably means that a meaningless internal encoding is being used. The only way to access the text is to use OCR. This may be a future enhancement.
^ Does that mean that PDFBox underneath already checks ASCII charset for a font and I don't have to worry about ASCII at all?
Dumb question: Can I somehow provide the missing Unicode mapping to PDFBox? Using Tika? Say I have limited number of known failing Unicode mappings, can I somehow give such information to the parser?
My case:
I already have bunch of documents (so no control over creation) I'd like to run search against, so I thought to create an index of them using Tika. I'd presume most of pdf file would have the issue with the not embedded font. So potentially I've thought about using Tika's Metadata.get(UNMAPPED_UNICODE_CHARS_PER_PAGE)
property to decide if them require OCR parsing by Tika (pdfConfig.setOcrStrategy(OCR_STRATEGY.OCR_ONLY)
) after NO_OCR parsing. OCR is x10 times slower and less precise (it also reads images like logo and trying to make sense of it, although it's not needed and just adding noise), but I don't see any other options.
Appreciate any thoughts on that, thanks.
UPD
Thanks @mkl for a direction for injecting the font mapping prior to parsing. I wonder if it's possible to automate this assuming I have the fonts (TTF files) the pdf was built with?
The code snippets I use for parsing, if it's any help:
Tika:
static void parseAndPrint(String fileName) throws IOException, TikaException {
Tika tika = new Tika();
InputStream is = Main.class.getResourceAsStream(fileName);
System.out.println(tika.parseToString(is));
}
PDFBox:
static void parseAndPrint(String fileName) throws IOException {
System.out.println("========= Start File: " + fileName);
InputStream file = Main.class.getResourceAsStream(fileName);
PDFTextStripper tStripper = new PDFTextStripper();
PDDocument document = PDDocument.load(file);
System.out.print(tStripper.getText(document));
document.close();
System.out.println("========= End File: " + fileName);
}