Parsing a pdf(Devanagari script) using PDFminer gives incorrect output [duplicate]
Asked Answered
A

1

7

I am trying to parse a pdf file containing Indian voters list which is in hindi(Devanagari script).

PDF displays all the text correctly but when I tried dumping this pdf into text format using PDFminer it output the characters which are different from the original pdf characters

For example Displayed/Correct word is सामान्य

But the output word is सपमपनद

Now I want to know why this is happening and how do I correctly parse this type of pdf file

I am also including the sample pdf file-

http://164.100.180.82/Rollpdf/AC276/S24A276P001.pdf

Anglin answered 7/8, 2015 at 11:15 Comment(12)
@mkl So I saw your answer in other thread & according to it unicode mapping/Information is broken, I tried with other pdf similar to it and it was doing fine. Is there no other way than OCR'ing this pdf?Anglin
Files in which those information are broken, usually cannot easily be text-extracted. Depending on the nature of the problem, it sometimes is possible. E.g. if Unicode information in PDF form are broken but not in the embedded fonts. As far as I remember, though, in the case of that other question both types were broken.Acidulant
@Acidulant I didn't get the part "E.g. if Unicode information in PDF form are broken but not in the embedded fonts".. ? How do i check for this in my case 164.100.180.82/Rollpdf/AC276/S24A276P001.pdfAnglin
And also one thing more, how is my pdf reader able to display the correct characters despite the mapping being broken?Anglin
I didn't get the part... - if a non-standard font is used in a PDF, it often is embedded, at least the subset of glyphs which actually are used in the PDF. In such a case information on which glyph corresponds to which Unicode character can be both in the native PDF format or in the embedded font data. If either is undamaged, one can use that mapping for text extraction.Acidulant
And also one thing more... - there is a mapping from character id to font glyph. That mapping works. But that character id may initially have been chosen arbitrarily. For text extraction one needs a mapping from character id to Unicode or from font glyph tho Unicode. The one is broken and the other one is up to be checked.Acidulant
Thanks a lot @Acidulant , you just explained it perfectly! Now If i don't bother you much, can you give me a way so I can check the mapping myself if it is correct or notAnglin
can you give me a way so I can check the mapping myself - I would inspect the embedded fonts using font forge and check some glyphs for which the mapping in PDF format is wrong.Acidulant
@Acidulant I tried to but couldn't achieve anything , can you check for it once you reach office and have some spare time todayAnglin
Hey, were you able to solve this issue? I am also stucked on this problem. Though I am able to understand what does @Acidulant wanted to convey but how to solve the issue is still not clearSuez
@VirajNalawade : Has anybody found best way to get correct output? Please share your inputs..ThanksConfusion
Hii @Anglin , Did you get the solution for this?Erection
A
3

This issue is very similar to the one discussed in this answer, and the appearance of the sample document there does also remind of the document here.

Just like in the case of the document in that other question, the ToUnicode map of the Devanagari script font used in the document here maps multiple completely different glyphs to identical Unicode code points. Thus, text extraction based on this mapping is bound to fail, and most text extractors rely on these information, especially in the absence of an font Encoding entry like here.


Some text extractors can use the mapping of glyph to Unicode contained in the embedded font program (if present). But checking this mapping in the Devanagari script font program used in the document here, it turns out that it associates most glyphs with U+f020 through U+f062 named "uniF020" etc.

Compact UnicodeBmp

These Unicode code points are located in the Unicode Private Use Area, i.e. they do not have a standardized meaning but applications may use them as they like.

Thus, text extractors using the Unicode mapping contained in the font program wouldn't deliver immediately intelligible text either.


There is one fact, though, which can help you to mostly automatize text extraction from this document nonetheless: The same PDF object is referenced for the Devanagari script font on multiple pages, so on all pages referencing the same PDF object the same original character identifier or the same font program private use Unicode code point refer to the same visual symbol. In case of your document I counted only 5 copies of the font.

Thus, if you find a text extractor which either returns the character identifier (ignoring all toUnicode maps) or returns the private use area Unicode code points from the font program, you can use its output and merely replace each entry according to a few maps.


I had not yet have use for such a text extractor, so I don't know any in the python context. But who knows, probably pdfminer or any other similar package can be told (by some option) to ignore the misleading ToUnicode map and then be used as outlined above.

Acidulant answered 10/8, 2015 at 15:8 Comment(7)
Thanks..I understood the what you explain here..Can you please guide with how to update ToUnicode in "pdfquery" or "pdfminer"? Any other library could help me? Thanks again. I'm having the same issue..Confusion
Unfortunately i cannot guide as I don't know those libraries in detail. When I process PDFs, I usually use java and java libraries.Acidulant
thanks for your response..If I use java then which lib or doc could help me with these?Confusion
With Java you can use iText or PDFBox or any general-purpose PDF library that allows you direct access to the basic PDF objects. For example you can find code to remove all ToUnicode maps here for iText.and here for PDFBox.Acidulant
Can we print the pdf as image (png or jpeg) and then you can use OpenCV (Python) OCR.? Would that help me?Confusion
OCR always is a resort when confronted with broken Unicode mappings or Encodings.Acidulant
Hii @NiksJain , Did you get the solution for this?Erection

© 2022 - 2024 — McMap. All rights reserved.