How to extract text with iTextSharp 4.1.6?
Asked Answered
C

1

10

iTextSharp 4.1.6 is the last version licensed under LGPL and is free to use in commercial purpose without paying license fees.

It might be interesting for some and for me, how to extract text with this version.

Does anyone have an idea?

Consecrate answered 13/4, 2012 at 14:50 Comment(4)
See the following link for an example: #2551296Someday
@Hans, does that solution work with 4.1.6? ITextExtractionStrategy, SimpleTextExtractionStrategy and PdfTextExtractor are unknown to me.Harmonie
I tried using the code at codeproject.com/Articles/14170/… . I found it only works for some PDFs; and it throws IndexOutOfRangeExceptions in CheckToken when it is called with single-character arguments (as that sample does).Suspensor
@SpoiledTechie.com No, didn't try to fix it. I just used another solution.Consecrate
U
10

I had to hack this together manually as I was in the same boat as you. Hopefully this well help. It's probably not perfect, but I was able to get the text I needed out of the document this way. fileName is a string variable/parameter to the PDF file.

var reader = new PdfReader(fileName);

StringBuilder sb = new StringBuilder();

try
{
    for (int page = 1; page <= reader.NumberOfPages; page++)
    {
        var cpage = reader.GetPageN(page);
        var content = cpage.Get(PdfName.CONTENTS);

        var ir = (PRIndirectReference)content;

        var value = reader.GetPdfObject(ir.Number);

        if (value.IsStream())
        {
            PRStream stream = (PRStream)value;

            var streamBytes = PdfReader.GetStreamBytes(stream);

            var tokenizer = new PRTokeniser(new RandomAccessFileOrArray(streamBytes));

            try
            {
                while (tokenizer.NextToken())
                {
                    if (tokenizer.TokenType == PRTokeniser.TK_STRING)
                    {
                        string str = tokenizer.StringValue;
                        sb.Append(str);
                    }
                }
            }
            finally
            {
                tokenizer.Close();
            }
        }
    }
}
finally
{
    reader.Close();
}

return sb.ToString();
Undertint answered 24/11, 2013 at 17:43 Comment(4)
This is one of the poor-man's text extraction solutions one sees so often. Actually the text extraction capabilities in iText 2.1.7/4.2.0 were much more advanced than that (in spite of having quite some deficits). Most likely they are also present in the latest iTextSharp befor the license change. Give them a try!Equivocal
@Equivocal -- There is no PdfTextExtractor in iTextSharp at that version, at least not in the iTextSharp-LGPL NuGet package. This was the only way I could find to do it. If you know of a better way that is actually in the DLL, I'd appreciate it!Undertint
Also I found a case where "content" is not a PRIndirectReference and instead is a PdfArray of PRIndirectReferences, so that case has to be handled accordingly as well.Undertint
You are right, my assumption that the text extraction capabilities of the Java version had been ported to iTextSharp before the license change are wrong. Thus, I can think of no way short of porting the parser classes from Java iText 4.2.0 to C# yourself. I have no idea how easy or hard that is. Or, of course, you could try and switch to a current version of iTextSharp as soon as AGPL or commercial licensing become an option for you.Equivocal

© 2022 - 2024 — McMap. All rights reserved.