Unable to copy exact hindi content from pdf
Asked Answered
P

1

9

I am not able to copy hindi content from pdf file. When I am trying to copy/paste that content it changes to different hindi characters.

Example-

Original- निर्वाचक

After paste- ननरररचक

it shows like this.

Anybody can help me to get the exact hindi characters.

Phyllome answered 10/6, 2015 at 12:19 Comment(4)
Very often hindi fonts are embedded with incorrect glyph-to-Unicode mappings. Applying OCR might be necessary.Unimpeachable
It's impossible to help you in any way without seeing an actual PDF document showing this problem.Astounding
Hello @SavendraSingh I am facing exactly same issue with a similar document. I need a favour from you. Can you share how did you resolve this issue. How did you read the document?? Your response will be really helpful to me..Lavern
I solved this issue with OCR. I did complete voter data extraction for Karnataka.Perchloride
U
6

This issue is similar to the one discussed in this answer, and the appearance of the sample document there does also remind of the document here:

In a nutshell

Your document itself provides the information that e.g. the glyphs "निर्वाचक" in the head line represent the text "ननरररचक". You should ask the source of your document for a document version in which the font informations are not misleading. If that is not possible, you should go for OCR.

In detail

The top line of the first page is generated by the following operations in the page content stream:

/9 239 Tf
( !"#$%&) Tj 

The first line selects the font named 9 at a size of 239 (an operation at the beginning of the page scales everything down). The second line causes glyphs to be printed. These glyphs are referenced inbetween the brackets using the custom encoding of that font.

The font 9 on the first page of your PDF contains a ToUnicode map. This map especially maps

<20> <20> <0928>
<21> <21> <0928>
<22> <22> <0930>
<23> <23> <0930>
<24> <24> <0930> 

i.e. the codes 0x20 (' ') and 0x21 ('!') both map to the Unicode code point 0x0928 ('न') and the codes 0x22 ('"'), 0x23 ('#'), and 0x24 ('$') all to the Unicode code point 0x0930 ('र').

Thus, the contents of ( !"#$%&), displayed as "निर्वाचक", completely correctly (according to the information in the document) are extracted / copy&pasted as "ननरररचक".

Unimpeachable answered 12/6, 2015 at 13:24 Comment(13)
I have more than 250 pdf files of same type and i am not able to extract given content and OCR also not working correctly it missed many of the characters.Phyllome
@Unimpeachable can you then please explain, how to solve this issue? I have got the problem that you are trying to raise but how to solve this issue is still not clearMetallize
@Metallize can you then please explain, how to solve this issue - in another answer it looked like there were actually only a few font dictionaries in each PDF. One solution would be to present each glyph of each of these fonts to a user who would then provide the correct Unicode character. From these information one can then build a ToUnicode map for each font object and replace the original one.Unimpeachable
@Metallize If you have many documents and the fonts in them are subsets of the same few actual full fonts, you can more and more automatize this by recognizing glyphs already mapped to Unicode by the user before and re-using his former input.Unimpeachable
@Metallize Creating this tool is a non-trivial project in its own right, the developer should know his way around in PDF internals and font format internals. If you need to process very many such documents, the work may pay out, though.Unimpeachable
i dont have many documents, Infact, I have to work on similar document that other users(@SavendraSingh, @Rohit) have asked for. For now, a simple method would help.Metallize
@Unimpeachable : Could it be that, instead of utf-8, some other encoding can be used with this documennt type?Radbun
@Radbun the font encodings used here are ad-hoc encodings (completely non-standard, single user only). The ToUnicode maps are meant to map that to unicode, but they simply lie here.Unimpeachable
@Unimpeachable : I would like to ask , if there is a reason why such mapping problems are there in pdfs, human error, or is there a certain reasoning behind it, because I find these in standard govt. docs too.Radbun
"I would like to ask , if there is a reason why such mapping problems are there in pdfs" - I'm not sure but I'd guess that in the case at hand the PDF generator simply is deficient. There are cases, though, in which by design the mapping has been corrupted to make text extraction difficult, see for example this answer.Unimpeachable
@Unimpeachable : How are (' ') mapped to 0x20, 0x21 ('!') then to the unicode code point 0x0928 ('न'), I have got the page content stream, but I only get the fontmap to unicode mapping such as <01> <0020> as you can see here.Radbun
@Radbun I don't really understand what your question is here.Unimpeachable
@Radbun "How are (' ') mapped to 0x20" - in the answer above I posted an excerpt of the content stream. Strictly speaking there is no space character ' ' in the content stream but there is a byte 0x20 in the content stream. Merely for visualization I said that there is the instruction ( !"#$%&) Tj, actually there is the instruction comprised of the byte sequence 0x28 0x20 0x21 0x22 0x23 0x24 0x25 0x26 0x29 0x20 0x54 0x6A. But here you immediately have the hex values mapped by the ToUnicode map.Unimpeachable

© 2022 - 2024 — McMap. All rights reserved.