Read PDF using itextsharp where PDF language is non-English

Asked 13/3, 2013 at 12:24 Answered 22/3, 2013 at 9:27

I am trying to read this PDF using itextsharp in C# which will convert this pdf into word file. also it needs to maintain table formating and fonts in word when i try with English pdf it will work perfectly but using some of the Indian languages like Hindi, Marathi it is not working.

 public string ReadPdfFile(string Filename)
        {

            string strText = string.Empty;
            StringBuilder text = new StringBuilder();
            try
            {
                PdfReader reader = new PdfReader((string)Filename);
                if (File.Exists(Filename))
                {
                    PdfReader pdfReader = new PdfReader(Filename);

                    for (int page = 1; page <= pdfReader.NumberOfPages; page++)
                    {                        ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                        string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

                        text.Append(currentText);
                        pdfReader.Close();
                    }
                }
            }
            catch (Exception ex)
            {
                MessageBox.Show(ex.Message);
            }
            textBox1.Text = text.ToString();
            return text.ToString(); ;
        }

Viera answered 13/3, 2013 at 12:24 Comment(5)

Unfortunately you merely say it is not working but not what is going wrong. That being said, though, when copying&pasting from your document with Acrobat Reader, I get characters which definitively look different from the original PDF content. As Acrobat Reader has a fairly good text extraction machine, I assume that the Indian language text in your PDF does not carry all the information necessary for text extraction short of OCR. – Otorhinolaryngology 13/3, 2013 at 13:10

@Otorhinolaryngology thanks for the reply the Problem is it is reading word मतदरर actual word is मतद|र. It is happening to all words in the pdf. So the actual meaning of the word is changed. what is your suggestion on the issue? – Viera 15/3, 2013 at 4:27

I'll look into the PDF. But as even adobe reader does not properly extract text from the PDF, i assume that the Indian language text in your PDF does not carry all the information necessary for text extraction short of OCR. – Otorhinolaryngology 15/3, 2013 at 12:3

@Otorhinolaryngology So does that means this PDF will not be converted into word file? – Viera 19/3, 2013 at 4:17

Hi @RahulRajput I am facing a similar problem. Could you please share your phone number on my twitter: twitter.com/sunderbhiya . I would love to talk to you about this in detail. – Dehydrate 27/2, 2023 at 13:18

I inspected your file with a special focus on your sample "मतद|र" being extracted as "मतदरर" in the topmost line of the document pages.

In a nutshell:

Your document itself provides the information that e.g. the glyphs "मतद|र" in the head line represent the text "मतदरर". You should ask the source of your document for a document version in which the font informations are not misleading. If that is not possible, you should go for OCR.

In detail:

The top line of the first page is generated by the following operations in the page content stream:

/9 280 Tf
(-12"!%$"234%56*5) Tj

The first line selects the font named /9 at a size of 280 (an operation at the beginning of the page scales everything by a factor of 0.05; thus, the effective size is 14 units which you observe in the file).

The second line causes glyphs to be printed. These glyphs are referenced inbetween the brackets using the custom encoding of that font.

When a program tries to extract the text, it has to deduce the actual characters from these glyph references using information from the font.

The font /9 on the first page of your PDF is defined using these objects:

242 0 obj<<
    /Type/Font/Name/9/BaseFont 243 0 R/FirstChar 33/LastChar 94
    /Subtype/TrueType/ToUnicode 244 0 R/FontDescriptor 247 0 R/Widths 248 0 R>>
endobj
243 0 obj/CDAC-GISTSurekh-Bold+0
endobj 
247 0 obj<<
    /Type/FontDescriptor/FontFile2 245 0 R/FontBBox 246 0 R/FontName 243 0 R
    /Flags 4/MissingWidth 946/StemV 0/StemH 0/CapHeight 500/XHeight 0
    /Ascent 1050/Descent -400/Leading 0/MaxWidth 1892/AvgWidth 946/ItalicAngle 0>>
endobj

So there is no /Encoding element but at least there is a reference to a /ToUnicode map. Thus, a program extracting text has to rely on the given /ToUnicode mapping.

The stream referenced by /ToUnicode contains the following mappings of interest when extracting the text from (-12"!%$"234%56*5):

<21> <21> <0930>
<22> <22> <0930>
<24> <24> <091c>
<25> <25> <0020>
<2a> <2a> <0031>
<2d> <2d> <092e>
<31> <31> <0924>
<32> <32> <0926>
<33> <33> <0926>
<34> <34> <002c>
<35> <35> <0032>
<36> <36> <0030>

(Already here you can see that multiple character codes are mapped to the same unicode code point...)

Thus, text extraction must result in:

- = 0x2d -> 0x092e = म
1 = 0x31 -> 0x0924 = त
2 = 0x32 -> 0x0926 = द
" = 0x22 -> 0x0930 = र    instead of  |
! = 0x21 -> 0x0930 = र
% = 0x25 -> 0x0020 =  
$ = 0x24 -> 0x091c = ज
" = 0x22 -> 0x0930 = र
2 = 0x32 -> 0x0926 = द
3 = 0x33 -> 0x0926 = द
4 = 0x34 -> 0x002c = ,
% = 0x25 -> 0x0020 =  
5 = 0x35 -> 0x0032 = 2
6 = 0x36 -> 0x0030 = 0
* = 0x2a -> 0x0031 = 1
5 = 0x35 -> 0x0032 = 2

Thus, the text iTextSharp (and also Adobe Reader!) extract from the heading on the first document page is exactly what the document in its font informations claims is correct.

As the cause for this is the misleading mapping information in the font definition, it is not surprising that there are misinterpretations all over the document.

Otorhinolaryngology answered 22/3, 2013 at 9:27 Comment(8)

The better solution would be a proper source document. OCR works by rendering the PDF pages as bitmap graphics (e.g. using PDFBox) and applying OCR to them. I have no experience which OCR software is good for the job. If you feel like accepting the dare, you might want to instead create some code rendering only the glyphs contained in the fonts in a given PDF, OCR'ing them, deriving correct /ToUnicode tables, and adding these tables to the fonts in the respective PDF. – Otorhinolaryngology 22/3, 2013 at 10:1

@Otorhinolaryngology Is there any java code to get ToUnicode contains that is (-12"!%$"234%56*5) – Arango 20/8, 2014 at 11:53

I want to get content stream-> "/9 280 Tf (-12"!%$"234%56*5) Tj" using java code from the pdf attached with this question.If there is way please guide me. – Arango 21/8, 2014 at 10:46

Ah, I asked you to clarify because your former comment mentioned ToUnicode, but as your clarification does not mention it, it does not seem to be involved. – Otorhinolaryngology 21/8, 2014 at 13:48

@PrasadB That been said, please make your comment a question in its own right with a reference to this question. And also describe which java PDF libraries you would accept to work with in this context. – Otorhinolaryngology 21/8, 2014 at 14:23

@Otorhinolaryngology I am trying to extract text from "dropbox.com/s/ezz015t3qdqo5hk/test.pdf" this pdf with pdf libraries like pdfbox or itext but i am getting the same problem as above question. I have gone through your answer and got the solution but I want to extract the answer given by you progmatically is there any way to do this? – Arango 21/8, 2014 at 14:41

It is possible to extract the content. The problem is that due to the incorrect ToUnicode table you don't immediately know which code is which character. Furthermore, please make this request a stackoverflow question in its own right. Comments don't leave enough space for full answers. – Otorhinolaryngology 21/8, 2014 at 15:37

@Otorhinolaryngology I have posted my question and this is link. Please help me. – Arango 21/8, 2014 at 17:1

As @mkl said, we'll need more information as to why things aren't working. But I can tell you a couple of things that might help you.

First, SimpleTextExtractionStrategy is very simple. If you read the docs for it you'll see that:

If the PDF renders text in a non-top-to-bottom fashion, this will result in the text not being a true representation of how it appears in the PDF

What that means is that although a PDF may look like it should be read from top to bottom, it may have been written in a different order. The PDF you referenced actually has the second visual line written first. See my post here for a slightly smarter text extraction strategy that tries to return text top to bottom. When I run my code against the first page of your PDF it appears to pull out each "line" correctly.

Second, PDFs don't have a concept of tables. They just have text and lines drawn at certain locations and neither of these are related to each other. What that means is that you would need to calculate each and every line and build your own concept of a table, you won't find any code within iTextSharp that does this for you. I personally wouldn't even bother trying to write one.

Third, the text extraction is for pulling text which has nothing to do with fonts. If you want that you'll have to build that logic in yourself. See my post here for a very basic start at it.

Nanji answered 13/3, 2013 at 14:11 Comment(3)

+1; a remark, though: SimpleTextExtractionStrategy while being simple, for some documents may still be the best choice; especially in case of multicolumnar text without easily recognizable column separation as long as the text has been added to the content in reading order. One essentially has to decide on a per document basis. – Otorhinolaryngology 13/3, 2013 at 20:11

@Chris Haas thanks for the reply the Problem is it is reading word मतदरर where actual word is मतद|र. It is happening to all words in the pdf. So the actual meaning of the word is changed. – Viera 15/3, 2013 at 4:21

As @Otorhinolaryngology said, the fact that even Adobe's programs think it the wrong text says that there might be a great problem – Nanji 15/3, 2013 at 15:14

Recommended topics

Hot tags