Is there an easy way to manually decode a FlateDecode Filter to extract text in a PDF? C#

Asked 11/9, 2014 at 23:34 Answered 29/3, 2018 at 11:54

I posted a question related to this a while back but got no responses. Since then, I've discovered that the PDF is encoded using FlateDecode, and I was wondering if there is a way to manually decode the PDF in C# (Windows Phone 8)? I'm getting output like the following:

%PDF-1.5
%????
1 0 obj
<<
/Type /Catalog
/Pages 2 0 R
>>
endobj
5 0 obj
<<
/Filter /FlateDecode
/Length 9
>>
stream x^+

The PDF has been created using the SyncFusion PDF controls for Windows Phone 8. Unfortunately, they do not currently have a text extraction feature, and I couldn't find that feature in other WP PDF controls either.

Basically, all I want is to download the PDF from OneDrive and read the PDF contents. Curious if this is easily doable?

Triolet answered 11/9, 2014 at 23:34 Comment(2)

if there is a way to manually decode the PDF in C# - of course there is; several existing .Net PDF libraries are written purely in C#... if this is easily doable - if you use such a library, text extraction from PDFs up to a certain PDF content complexity is easy. If you try not to use such a library, chances are you have to re-invent the wheel, i.e. to program code resembling major parts of such a library. This is not easy. – Donnie 12/9, 2014 at 9:30

Thanks for your response mkl. Unfortunately, none of the .NET libraries I've found that relate to text extraction support Windows Phone 8 (they reference a library WP8 doesn't support). Seems like I may have to re-invent the wheel or wait for one of the companies to come out with an Extract Text feature. I've read Adobe's documentation (acroeng.adobe.com/wp/?page_id=321) on this but it isn't clear how to manually decode it. – Triolet 12/9, 2014 at 11:43

private static string decompress(byte[] input)
{
    byte[] cutinput = new byte[input.Length - 2];
    Array.Copy(input, 2, cutinput, 0, cutinput.Length);

    var stream = new MemoryStream();

    using (var compressStream = new MemoryStream(cutinput))
    using (var decompressor = new DeflateStream(compressStream, CompressionMode.Decompress))
        decompressor.CopyTo(stream);

    return Encoding.Default.GetString(stream.ToArray());
}

According to below similar question the first 2 bytes of the stream has to be cut from the stream. This is done in above function. Just pass all bytes of the stream to input. Make sure the bytecount is the same as the length specified.

C# decode (decompress) Deflate data of PDF File

Bellerophon answered 29/3, 2018 at 11:54 Comment(2)

+1 for mentioning that you have to trim the first 2 bytes... It works! How would we possibly know that??? – Vantassel 20/12, 2019 at 3:46

Thanks, yes in the other thread user1011394 explains it has something to do with RFC1951 over RC1950. I did some more research and found out these 2 bytes are the RFC 1950 - ZLIB framing bytes. Maybe another way is to use the ZLIB libary. – Bellerophon 21/12, 2019 at 22:22

The easiest solution is to use DeflateStream provided by .NET framework. Example can be found in similar thread. This approach might have some pitfalls.

If this doesn't work, there are libraries (like DotNetZip), capable of deflate stream decompression. Please check this link for performance comparison.

The last possible option I see, without reinventing wheel is to use other PDF parsing libraries and use them for stream decompression, or even for whole PDF processing.

Opus answered 12/9, 2014 at 14:2 Comment(0)

Recommended topics

Hot tags