How to extract text with iTextSharp 4.1.6?

About

Asked 13/4, 2012 at 14:50 Answered 24/11, 2013 at 17:43

iTextSharp 4.1.6 is the last version licensed under LGPL and is free to use in commercial purpose without paying license fees.

It might be interesting for some and for me, how to extract text with this version.

Does anyone have an idea?

Consecrate answered 13/4, 2012 at 14:50 Comment(4)

See the following link for an example: #2551296 – Someday 13/4, 2012 at 16:30

@Hans, does that solution work with 4.1.6? ITextExtractionStrategy, SimpleTextExtractionStrategy and PdfTextExtractor are unknown to me. – Harmonie 13/9, 2012 at 12:43

I tried using the code at codeproject.com/Articles/14170/… . I found it only works for some PDFs; and it throws IndexOutOfRangeExceptions in CheckToken when it is called with single-character arguments (as that sample does). – Suspensor 26/10, 2012 at 13:19

@SpoiledTechie.com No, didn't try to fix it. I just used another solution. – Consecrate 8/8, 2013 at 9:15

I had to hack this together manually as I was in the same boat as you. Hopefully this well help. It's probably not perfect, but I was able to get the text I needed out of the document this way. fileName is a string variable/parameter to the PDF file.

var reader = new PdfReader(fileName);

StringBuilder sb = new StringBuilder();

try
{
    for (int page = 1; page <= reader.NumberOfPages; page++)
    {
        var cpage = reader.GetPageN(page);
        var content = cpage.Get(PdfName.CONTENTS);

        var ir = (PRIndirectReference)content;

        var value = reader.GetPdfObject(ir.Number);

        if (value.IsStream())
        {
            PRStream stream = (PRStream)value;

            var streamBytes = PdfReader.GetStreamBytes(stream);

            var tokenizer = new PRTokeniser(new RandomAccessFileOrArray(streamBytes));

            try
            {
                while (tokenizer.NextToken())
                {
                    if (tokenizer.TokenType == PRTokeniser.TK_STRING)
                    {
                        string str = tokenizer.StringValue;
                        sb.Append(str);
                    }
                }
            }
            finally
            {
                tokenizer.Close();
            }
        }
    }
}
finally
{
    reader.Close();
}

return sb.ToString();

Undertint answered 24/11, 2013 at 17:43 Comment(4)

This is one of the poor-man's text extraction solutions one sees so often. Actually the text extraction capabilities in iText 2.1.7/4.2.0 were much more advanced than that (in spite of having quite some deficits). Most likely they are also present in the latest iTextSharp befor the license change. Give them a try! – Equivocal 24/11, 2013 at 20:10

@Equivocal -- There is no PdfTextExtractor in iTextSharp at that version, at least not in the iTextSharp-LGPL NuGet package. This was the only way I could find to do it. If you know of a better way that is actually in the DLL, I'd appreciate it! – Undertint 25/11, 2013 at 15:9

Also I found a case where "content" is not a PRIndirectReference and instead is a PdfArray of PRIndirectReferences, so that case has to be handled accordingly as well. – Undertint 25/11, 2013 at 15:11

You are right, my assumption that the text extraction capabilities of the Java version had been ported to iTextSharp before the license change are wrong. Thus, I can think of no way short of porting the parser classes from Java iText 4.2.0 to C# yourself. I have no idea how easy or hard that is. Or, of course, you could try and switch to a current version of iTextSharp as soon as AGPL or commercial licensing become an option for you. – Equivocal 25/11, 2013 at 15:36

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags