Parsing a pdf(Devanagari script) using PDFminer gives incorrect output [duplicate] - McMap

About

Parsing a pdf(Devanagari script) using PDFminer gives incorrect output [duplicate]

Asked 7/8, 2015 at 11:15 Answered 10/8, 2015 at 15:8

Solved python parsing pdf hindi pdfminer

A

1

7

I am trying to parse a pdf file containing Indian voters list which is in hindi(Devanagari script).

PDF displays all the text correctly but when I tried dumping this pdf into text format using PDFminer it output the characters which are different from the original pdf characters

For example Displayed/Correct word is सामान्य

But the output word is सपमपनद

Now I want to know why this is happening and how do I correctly parse this type of pdf file

I am also including the sample pdf file-

http://164.100.180.82/Rollpdf/AC276/S24A276P001.pdf

Anglin answered 7/8, 2015 at 11:15 Comment(12)

@mkl So I saw your answer in other thread & according to it unicode mapping/Information is broken, I tried with other pdf similar to it and it was doing fine. Is there no other way than OCR'ing this pdf? – Anglin 7/8, 2015 at 15:33

Files in which those information are broken, usually cannot easily be text-extracted. Depending on the nature of the problem, it sometimes is possible. E.g. if Unicode information in PDF form are broken but not in the embedded fonts. As far as I remember, though, in the case of that other question both types were broken. – Acidulant 7/8, 2015 at 15:43

@Acidulant I didn't get the part "E.g. if Unicode information in PDF form are broken but not in the embedded fonts".. ? How do i check for this in my case 164.100.180.82/Rollpdf/AC276/S24A276P001.pdf – Anglin 7/8, 2015 at 15:46

And also one thing more, how is my pdf reader able to display the correct characters despite the mapping being broken? – Anglin 7/8, 2015 at 15:48

I didn't get the part... - if a non-standard font is used in a PDF, it often is embedded, at least the subset of glyphs which actually are used in the PDF. In such a case information on which glyph corresponds to which Unicode character can be both in the native PDF format or in the embedded font data. If either is undamaged, one can use that mapping for text extraction. – Acidulant 7/8, 2015 at 17:56

And also one thing more... - there is a mapping from character id to font glyph. That mapping works. But that character id may initially have been chosen arbitrarily. For text extraction one needs a mapping from character id to Unicode or from font glyph tho Unicode. The one is broken and the other one is up to be checked. – Acidulant 7/8, 2015 at 18:1

Thanks a lot @Acidulant , you just explained it perfectly! Now If i don't bother you much, can you give me a way so I can check the mapping myself if it is correct or not – Anglin 7/8, 2015 at 18:45

can you give me a way so I can check the mapping myself - I would inspect the embedded fonts using font forge and check some glyphs for which the mapping in PDF format is wrong. – Acidulant 8/8, 2015 at 8:15

@Acidulant I tried to but couldn't achieve anything , can you check for it once you reach office and have some spare time today – Anglin 9/8, 2015 at 18:41

Hey, were you able to solve this issue? I am also stucked on this problem. Though I am able to understand what does @Acidulant wanted to convey but how to solve the issue is still not clear – Suez 23/10, 2015 at 13:47

@VirajNalawade : Has anybody found best way to get correct output? Please share your inputs..Thanks – Confusion 22/11, 2017 at 10:54

Hii @Anglin , Did you get the solution for this? – Erection 2/2 at 11:6

A

3

This issue is very similar to the one discussed in this answer, and the appearance of the sample document there does also remind of the document here.

Just like in the case of the document in that other question, the ToUnicode map of the Devanagari script font used in the document here maps multiple completely different glyphs to identical Unicode code points. Thus, text extraction based on this mapping is bound to fail, and most text extractors rely on these information, especially in the absence of an font Encoding entry like here.

Some text extractors can use the mapping of glyph to Unicode contained in the embedded font program (if present). But checking this mapping in the Devanagari script font program used in the document here, it turns out that it associates most glyphs with U+f020 through U+f062 named "uniF020" etc.

These Unicode code points are located in the Unicode Private Use Area, i.e. they do not have a standardized meaning but applications may use them as they like.

Thus, text extractors using the Unicode mapping contained in the font program wouldn't deliver immediately intelligible text either.

There is one fact, though, which can help you to mostly automatize text extraction from this document nonetheless: The same PDF object is referenced for the Devanagari script font on multiple pages, so on all pages referencing the same PDF object the same original character identifier or the same font program private use Unicode code point refer to the same visual symbol. In case of your document I counted only 5 copies of the font.

Thus, if you find a text extractor which either returns the character identifier (ignoring all toUnicode maps) or returns the private use area Unicode code points from the font program, you can use its output and merely replace each entry according to a few maps.

I had not yet have use for such a text extractor, so I don't know any in the python context. But who knows, probably pdfminer or any other similar package can be told (by some option) to ignore the misleading ToUnicode map and then be used as outlined above.

Acidulant answered 10/8, 2015 at 15:8 Comment(7)

Thanks..I understood the what you explain here..Can you please guide with how to update ToUnicode in "pdfquery" or "pdfminer"? Any other library could help me? Thanks again. I'm having the same issue.. – Confusion 22/11, 2017 at 17:32

Unfortunately i cannot guide as I don't know those libraries in detail. When I process PDFs, I usually use java and java libraries. – Acidulant 22/11, 2017 at 21:9

thanks for your response..If I use java then which lib or doc could help me with these? – Confusion 23/11, 2017 at 4:44

With Java you can use iText or PDFBox or any general-purpose PDF library that allows you direct access to the basic PDF objects. For example you can find code to remove all ToUnicode maps here for iText.and here for PDFBox. – Acidulant 23/11, 2017 at 13:39

Can we print the pdf as image (png or jpeg) and then you can use OpenCV (Python) OCR.? Would that help me? – Confusion 24/11, 2017 at 10:49

OCR always is a resort when confronted with broken Unicode mappings or Encodings. – Acidulant 24/11, 2017 at 12:31

Hii @NiksJain , Did you get the solution for this? – Erection 2/2 at 11:6

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.