PDF text extraction returns wrong characters due to ToUnicode map
Asked Answered
K

1

5

I am trying to extract text from a foreign language PDF file using PDFMiner, but am being foiled by a ToUnicode statement. The file behaves strangely even under normal PDF viewers.

For example, here is a screenshot from some text in the file:

correct text

But if I select and copy the text, it looks like this:

िनरकर

You can see several characters have changed, in particular the second-to-last character.

Not surprisingly, PDFMiner extracts the incorrect text. But every PDF viewer manages to display these data correctly. I suspect the issue is either the ToUnicode map, or something with conjoined characters. The desired letter should be a sequence of 0x915, 0x94D, 0x937. PDFMiner only reports 0x915, which describes a different character.

What do I need to do to get PDFMiner to extract text correctly, i.e. as in the image rather than the copy-pasted text?

Here is a link to the PDF in question.

Kano answered 23/2, 2015 at 16:47 Comment(0)
S
11

In short:

Your PDF does not contain the information required for correct text extraction without the use of OCR.

In detail:

Both the ToUnicode Map and the Unicode entries in the font program of the embedded subset of Mangal-Regular in your PDF claim that these four glyphs

Four glyphs claiming to be 0x915

all represent the same Unicode code point, 0x915.

Thus, any text extraction program which does not look at the drawn glyph (i.e. not attempt OCR) will return 0x915 for either one of those glyphs.

Background:

You seem to wonder why the PDF viewers correctly display the text but text extraction (copy&paste or PDFMiner) does not correctly extract.

The reason is that PDF as a format does not contain the text as such. It contains pointers (direct ones or via mappings) to glyph drawing instructions in embedded font programs. Using these pointers the PDF is drawn as you expect.

Furthermore it can contain extra information mapping such glyph pointers to Unicode code points. Such extra information is used by text extracting programs. In case of your PDF these mappings are incorrect and, therefore, extracted text is incorrect.

Suctorial answered 24/2, 2015 at 11:33 Comment(4)
Great answer. OCR is not an option here; however I'm willing to go to a low-level PDF tool. The font tables in this PDF are not large; I could manually create my own correct ToUnicode map. At that point, can I either (i) overwrite the ToUnicode map in this PDF, or (ii) modify the code of an extraction program (like PDFMiner) to use my manually created map instead? If either of these is feasible, what tool would you suggest to do it?Kano
I'd propose you overwrite the ToUnicode map in this PDF using a general purpose PDF library with a low-level object access API for a programming language of your choice. Then it is merely a matter of traversing the PDF object structure, finding the ToUnicode map stream, replacing its content, and saving the result.Suctorial
The problem we're facing is eerily similar - except it's for Lohit - Devanagari. Were you able to work something out?Ahoy
@Ahoy I only proposed an approach and didn't work anything out in this regard, and pnj's profile says "Last seen Jul 12 '18 at 12:18", so he's not likely to react at all. Thus, you may want to make this an actual question (not a mere comment) referring to this answer but requesting more hands-on help.Suctorial

© 2022 - 2024 — McMap. All rights reserved.