Python: Inflate and Deflate implementations
Asked Answered
P

2

73

I am interfacing with a server that requires that data sent to it is compressed with Deflate algorithm (Huffman encoding + LZ77) and also sends data that I need to Inflate.

I know that Python includes Zlib, and that the C libraries in Zlib support calls to Inflate and Deflate, but these apparently are not provided by the Python Zlib module. It does provide Compress and Decompress, but when I make a call such as the following:

result_data = zlib.decompress( base64_decoded_compressed_string )

I receive the following error:

Error -3 while decompressing data: incorrect header check

Gzip does no better; when making a call such as:

result_data = gzip.GzipFile( fileobj = StringIO.StringIO( base64_decoded_compressed_string ) ).read()

I receive the error:

IOError: Not a gzipped file

which makes sense as the data is a Deflated file not a true Gzipped file.

Now I know that there is a Deflate implementation available (Pyflate), but I do not know of an Inflate implementation.

It seems that there are a few options:

  1. Find an existing implementation (ideal) of Inflate and Deflate in Python
  2. Write my own Python extension to the zlib c library that includes Inflate and Deflate
  3. Call something else that can be executed from the command line (such as a Ruby script, since Inflate/Deflate calls in zlib are fully wrapped in Ruby)
  4. ?

I am seeking a solution, but lacking a solution I will be thankful for insights, constructive opinions, and ideas.

Additional information: The result of deflating (and encoding) a string should, for the purposes I need, give the same result as the following snippet of C# code, where the input parameter is an array of UTF bytes corresponding to the data to compress:

public static string DeflateAndEncodeBase64(byte[] data)
{
    if (null == data || data.Length < 1) return null;
    string compressedBase64 = "";

    //write into a new memory stream wrapped by a deflate stream
    using (MemoryStream ms = new MemoryStream())
    {
        using (DeflateStream deflateStream = new DeflateStream(ms, CompressionMode.Compress, true))
        {
            //write byte buffer into memorystream
            deflateStream.Write(data, 0, data.Length);
            deflateStream.Close();

            //rewind memory stream and write to base 64 string
            byte[] compressedBytes = new byte[ms.Length];
            ms.Seek(0, SeekOrigin.Begin);
            ms.Read(compressedBytes, 0, (int)ms.Length);
            compressedBase64 = Convert.ToBase64String(compressedBytes);
        }
    }
    return compressedBase64;
}

Running this .NET code for the string "deflate and encode me" gives the result

7b0HYBxJliUmL23Ke39K9UrX4HShCIBgEyTYkEAQ7MGIzeaS7B1pRyMpqyqBymVWZV1mFkDM7Z28995777333nvvvfe6O51OJ/ff/z9cZmQBbPbOStrJniGAqsgfP358Hz8iZvl5mbV5mi1nab6cVrM8XeT/Dw==

When "deflate and encode me" is run through the Python Zlib.compress() and then base64 encoded, the result is "eJxLSU3LSSxJVUjMS1FIzUvOT0lVyE0FAFXHB6k=".

It is clear that zlib.compress() is not an implementation of the same algorithm as the standard Deflate algorithm.

More Information:

The first 2 bytes of the .NET deflate data ("7b0HY..."), after b64 decoding are 0xEDBD, which does not correspond to Gzip data (0x1f8b), BZip2 (0x425A) data, or Zlib (0x789C) data.

The first 2 bytes of the Python compressed data ("eJxLS..."), after b64 decoding are 0x789C. This is a Zlib header.

SOLVED

To handle the raw deflate and inflate, without header and checksum, the following things needed to happen:

On deflate/compress: strip the first two bytes (header) and the last four bytes (checksum).

On inflate/decompress: there is a second argument for window size. If this value is negative it suppresses headers. here are my methods currently, including the base64 encoding/decoding - and working properly:

import zlib
import base64

def decode_base64_and_inflate( b64string ):
    decoded_data = base64.b64decode( b64string )
    return zlib.decompress( decoded_data , -15)

def deflate_and_base64_encode( string_val ):
    zlibbed_str = zlib.compress( string_val )
    compressed_string = zlibbed_str[2:-4]
    return base64.b64encode( compressed_string )
Presbyterate answered 6/7, 2009 at 23:24 Comment(0)
S
26

This is an add-on to MizardX's answer, giving some explanation and background.

See http://www.chiramattel.com/george/blog/2007/09/09/deflatestream-block-length-does-not-match.html

According to RFC 1950, a zlib stream constructed in the default manner is composed of:

  • a 2-byte header (e.g. 0x78 0x9C)
  • a deflate stream -- see RFC 1951
  • an Adler-32 checksum of the uncompressed data (4 bytes)

The C# DeflateStream works on (you guessed it) a deflate stream. MizardX's code is telling the zlib module that the data is a raw deflate stream.

Observations: (1) One hopes the C# "deflation" method producing a longer string happens only with short input (2) Using the raw deflate stream without the Adler-32 checksum? Bit risky, unless replaced with something better.

Updates

error message Block length does not match with its complement

If you are trying to inflate some compressed data with the C# DeflateStream and you get that message, then it is quite possible that you are giving it a a zlib stream, not a deflate stream.

See How do you use a DeflateStream on part of a file?

Also copy/paste the error message into a Google search and you will get numerous hits (including the one up the front of this answer) saying much the same thing.

The Java Deflater ... used by "the website" ... C# DeflateStream "is pretty straightforward and has been tested against the Java implementation". Which of the following possible Java Deflater constructors is the website using?

public Deflater(int level, boolean nowrap)

Creates a new compressor using the specified compression level. If 'nowrap' is true then the ZLIB header and checksum fields will not be used in order to support the compression format used in both GZIP and PKZIP.

public Deflater(int level)

Creates a new compressor using the specified compression level. Compressed data will be generated in ZLIB format.

public Deflater()

Creates a new compressor with the default compression level. Compressed data will be generated in ZLIB format.

A one-line deflater after throwing away the 2-byte zlib header and the 4-byte checksum:

uncompressed_string.encode('zlib')[2:-4] # does not work in Python 3.x

or

zlib.compress(uncompressed_string)[2:-4]
Superfuse answered 7/7, 2009 at 4:31 Comment(3)
@John Machin: To reply to your first observation... the result is only longer in the case of shorter strings (header? padding?). When I feed in 161 bytes of data for deflation, prior to base64 encoding the result is 126 bytes.Presbyterate
@John Machin: Great insights and information. The Java signature of deflater used is the one with two parameters, with nowrap==true. I used your one-line deflater example and it inflates well in .NET and Java, despite looking different than the value produced by deflating with the libraries in those languages. This is great. Now I am working on inflate - taking the deflated data produced by Java or .NET and adding on an adler32 checksum and the zlib header to see if I can get Python to consume it well. I'll let you know how it goes.Presbyterate
@John Machin: Solved. See above. Thanks for your assistance. The key was passing in a negative value to the decompress method for inflate, and your clipping of the header and adler checksum on compress.Presbyterate
A
36

You can still use the zlib module to inflate/deflate data. The gzip module uses it internally, but adds a file-header to make it into a gzip-file. Looking at the gzip.py file, something like this could work:

import zlib

def deflate(data, compresslevel=9):
    compress = zlib.compressobj(
            compresslevel,        # level: 0-9
            zlib.DEFLATED,        # method: must be DEFLATED
            -zlib.MAX_WBITS,      # window size in bits:
                                  #   -15..-8: negate, suppress header
                                  #   8..15: normal
                                  #   16..30: subtract 16, gzip header
            zlib.DEF_MEM_LEVEL,   # mem level: 1..8/9
            0                     # strategy:
                                  #   0 = Z_DEFAULT_STRATEGY
                                  #   1 = Z_FILTERED
                                  #   2 = Z_HUFFMAN_ONLY
                                  #   3 = Z_RLE
                                  #   4 = Z_FIXED
    )
    deflated = compress.compress(data)
    deflated += compress.flush()
    return deflated

def inflate(data):
    decompress = zlib.decompressobj(
            -zlib.MAX_WBITS  # see above
    )
    inflated = decompress.decompress(data)
    inflated += decompress.flush()
    return inflated

I don't know if this corresponds exactly to whatever your server requires, but those two functions are able to round-trip any data I tried.

The parameters maps directly to what is passed to the zlib library functions.

PythonC
zlib.compressobj(...)deflateInit(...)
compressobj.compress(...)deflate(...)
zlib.decompressobj(...)inflateInit(...)
decompressobj.decompress(...)inflate(...)

The constructors create the structure and populate it with default values, and pass it along to the init-functions. The compress/decompress methods update the structure and pass it to inflate/deflate.

Apportion answered 7/7, 2009 at 0:12 Comment(12)
What I am looking for is access to the C-level Inflate and Deflate calls of the library that the Python Zlib module wraps. It does not appear that Decompress and Compress do the same thing, and the Python Zlib module does not expose Inflate and DeflatePresbyterate
This is not useful. Please note the additional information I added to my question above. The code you provide above, when run with the string "deflate and encode me", results in "S0lNy0ksSVVIzEtRSM1Lzk9JVchNBQA=", which is even shorter. The correct Deflate result should look like the (longer) .NET generated string that I note above.Presbyterate
How does a 21-character input string result in a 212-byte deflated output? Does that include a deflate file header?Orwin
The inflate function managed to decode your data, but the deflate function can't seem to reproduce your string. I am experimenting with the arguments, trying to find some combination that would produce what you want.Apportion
It seems like the .NET version uses a different, but compatible, algorithm. Could you try to decode a string from the python deflate with .NET ? If it works, then there should be no problem with them encoding the same string differently.Apportion
@Adam: 212 bytes? His base64-encoded string is 160 bytes long, which DEcodes to 118 bytes. Perhaps you ENcoded it (160 * 4 / 3 approx== 212). Deflate file header? Perhaps you meant a gzip file header -- doesn't look like one one of those (gzip.org/zlib/rfc-gzip.html): doesn't start with 0x1F 0x8B (unless C# is using a non-default base64 alphabet). Would be nice if Demi provided (1) any more details available in the website spec (2) the argument docs for C# DeflateStream()Superfuse
@John Machin: I cannot give precise details about the server, other than to say that it uses java.util.zip.Deflater and java.util.zip.Inflater. The documentation for DeflateStream is at: msdn.microsoft.com/en-us/library/… - it is pretty straightforward and has been tested against the Java implementation.Presbyterate
@MizardX: I attempted to decode a string from Python deflate (compress) with .NET and it throws an exception: InvalidDataException ("Block length does not match with its complement.")Presbyterate
@Demi: Please tell us what the original string was and what it became after deflation and confirm that "Python deflate (compress)" means MizardX's code; showing the code that you used to "decode ... with .NET" would probably help also. Please consider editing your question to add all of this information.Superfuse
Positive window size (8..15) gives zlib header. Even higher (16..30) gives gzip header. Negative (-15..-8) gives no header. The only way to specify window size is by using the extra parameters of the compressobject constructor.Apportion
@MizardX: That info about window size and headers was in the code comments in your answer; why are you repeating it now?Superfuse
Your inflate is working perfectly for me and I'm not sure about the deflate as I'm not in need of it. Thank you!Vollmer
S
26

This is an add-on to MizardX's answer, giving some explanation and background.

See http://www.chiramattel.com/george/blog/2007/09/09/deflatestream-block-length-does-not-match.html

According to RFC 1950, a zlib stream constructed in the default manner is composed of:

  • a 2-byte header (e.g. 0x78 0x9C)
  • a deflate stream -- see RFC 1951
  • an Adler-32 checksum of the uncompressed data (4 bytes)

The C# DeflateStream works on (you guessed it) a deflate stream. MizardX's code is telling the zlib module that the data is a raw deflate stream.

Observations: (1) One hopes the C# "deflation" method producing a longer string happens only with short input (2) Using the raw deflate stream without the Adler-32 checksum? Bit risky, unless replaced with something better.

Updates

error message Block length does not match with its complement

If you are trying to inflate some compressed data with the C# DeflateStream and you get that message, then it is quite possible that you are giving it a a zlib stream, not a deflate stream.

See How do you use a DeflateStream on part of a file?

Also copy/paste the error message into a Google search and you will get numerous hits (including the one up the front of this answer) saying much the same thing.

The Java Deflater ... used by "the website" ... C# DeflateStream "is pretty straightforward and has been tested against the Java implementation". Which of the following possible Java Deflater constructors is the website using?

public Deflater(int level, boolean nowrap)

Creates a new compressor using the specified compression level. If 'nowrap' is true then the ZLIB header and checksum fields will not be used in order to support the compression format used in both GZIP and PKZIP.

public Deflater(int level)

Creates a new compressor using the specified compression level. Compressed data will be generated in ZLIB format.

public Deflater()

Creates a new compressor with the default compression level. Compressed data will be generated in ZLIB format.

A one-line deflater after throwing away the 2-byte zlib header and the 4-byte checksum:

uncompressed_string.encode('zlib')[2:-4] # does not work in Python 3.x

or

zlib.compress(uncompressed_string)[2:-4]
Superfuse answered 7/7, 2009 at 4:31 Comment(3)
@John Machin: To reply to your first observation... the result is only longer in the case of shorter strings (header? padding?). When I feed in 161 bytes of data for deflation, prior to base64 encoding the result is 126 bytes.Presbyterate
@John Machin: Great insights and information. The Java signature of deflater used is the one with two parameters, with nowrap==true. I used your one-line deflater example and it inflates well in .NET and Java, despite looking different than the value produced by deflating with the libraries in those languages. This is great. Now I am working on inflate - taking the deflated data produced by Java or .NET and adding on an adler32 checksum and the zlib header to see if I can get Python to consume it well. I'll let you know how it goes.Presbyterate
@John Machin: Solved. See above. Thanks for your assistance. The key was passing in a negative value to the decompress method for inflate, and your clipping of the header and adler checksum on compress.Presbyterate

© 2022 - 2024 — McMap. All rights reserved.