How can I decompress a gzip stream with zlib?
Asked Answered
A

3

124

Gzip format files (created with the gzip program, for example) use the "deflate" compression algorithm, which is the same compression algorithm as what zlib uses. However, when using zlib to inflate a gzip compressed file, the library returns a Z_DATA_ERROR.

How can I use zlib to decompress a gzip file?

Alderman answered 3/12, 2009 at 9:19 Comment(0)
A
134

To decompress a gzip format file with zlib, call inflateInit2 with the windowBits parameter as 16+MAX_WBITS, like this:

inflateInit2(&stream, 16+MAX_WBITS);

If you don't do this, zlib will complain about a bad stream format. By default, zlib creates streams with a zlib header, and on inflate does not recognise the different gzip header unless you tell it so. Although this is documented starting in version 1.2.1 of the zlib.h header file, it is not in the zlib manual. From the header file:

windowBits can also be greater than 15 for optional gzip decoding. Add 32 to windowBits to enable zlib and gzip decoding with automatic header detection, or add 16 to decode only the gzip format (the zlib format will return a Z_DATA_ERROR). If a gzip stream is being decoded, strm->adler is a crc32 instead of an adler32.

Alderman answered 3/12, 2009 at 9:20 Comment(4)
In python: zlib.decompress(data, 15 + 32)Eulogy
Thanks, this was highly frustrating until I found this post.Paige
Perhaps you can provide some guidelines for iterative decompression of gzip stream. In one-shot gzip decompression where your output stream and size should be fixed and sufficient for storing the whole decompressed output. This value depends on gzip decompression effectiveness that can vary according to data entropy. Is there a way to dynamically allocate more space to output buffer when needed ? ThanksGeulincx
I have no idea how this would work. BUT it does work.Wilda
S
127

python

zlib library supports:

The python zlib module will support these as well.

choosing windowBits

But zlib can decompress all those formats:

  • to (de-)compress deflate format, use wbits = -zlib.MAX_WBITS
  • to (de-)compress zlib format, use wbits = zlib.MAX_WBITS
  • to (de-)compress gzip format, use wbits = zlib.MAX_WBITS | 16

See documentation in http://www.zlib.net/manual.html#Advanced (section inflateInit2)

examples

test data:

>>> deflate_compress = zlib.compressobj(9, zlib.DEFLATED, -zlib.MAX_WBITS)
>>> zlib_compress = zlib.compressobj(9, zlib.DEFLATED, zlib.MAX_WBITS)
>>> gzip_compress = zlib.compressobj(9, zlib.DEFLATED, zlib.MAX_WBITS | 16)
>>> 
>>> text = '''test'''
>>> deflate_data = deflate_compress.compress(text) + deflate_compress.flush()
>>> zlib_data = zlib_compress.compress(text) + zlib_compress.flush()
>>> gzip_data = gzip_compress.compress(text) + gzip_compress.flush()
>>> 

obvious test for zlib:

>>> zlib.decompress(zlib_data)
'test'

test for deflate:

>>> zlib.decompress(deflate_data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
zlib.error: Error -3 while decompressing data: incorrect header check
>>> zlib.decompress(deflate_data, -zlib.MAX_WBITS)
'test'

test for gzip:

>>> zlib.decompress(gzip_data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
zlib.error: Error -3 while decompressing data: incorrect header check
>>> zlib.decompress(gzip_data, zlib.MAX_WBITS|16)
'test'

the data is also compatible with gzip module:

>>> import gzip
>>> import StringIO
>>> fio = StringIO.StringIO(gzip_data)
>>> f = gzip.GzipFile(fileobj=fio)
>>> f.read()
'test'
>>> f.close()

automatic header detection (zlib or gzip)

adding 32 to windowBits will trigger header detection

>>> zlib.decompress(gzip_data, zlib.MAX_WBITS|32)
'test'
>>> zlib.decompress(zlib_data, zlib.MAX_WBITS|32)
'test'

using gzip instead

For gzip data with gzip header you can use gzip module directly; but please remember that under the hood, gzip uses zlib.

fh = gzip.open('abc.gz', 'rb')
cdata = fh.read()
fh.close()
Stomach answered 10/3, 2014 at 21:6 Comment(7)
why this piece of gold is not on the docs on this exactly format?Graffito
please feel free to send a pull request / patch against cpython using any of this answer.Stomach
great answer for strings, any idea how to do this for a stream without reading the entire file into memory?Unconditioned
Thank you. I can solve my decompress problem in my source code with your answer.Hammons
incredible, this is a gold nugget.. however i can't help but feel these are tantamount to 'magic numbers'? where in the documentation is this mentioned? i looked, but must have really not checked hard enough.. also, the notation i don't fully follow. What does the | mean, is that optional? and why is deflate negative.. is MAX_WBITS a constant.. 🙁Parton
@m1nkeh: In Python, as in most (if not all) languages, | is the bitwise-OR operator, like +, -, * and so on. In practice, it is used to "set the bits" in a number. So zlib.MAX_WBITS | 16 means: In the number zlib.MAX_WBITS, "turn on" the bits that are set in 16 (which, being a power of 2, is a single bit).Nones
The inflate/deflate algorithms may be compatible between gzip and zlib, but they are separate implementations and seem to be completely independent of each other. For example, the gzip (1.9) library never includes "zlib.h" and the zlib (1.3.1) library never includes "gzip.h", either.Pecker
W
6

The structure of zlib and gzip is different. zlib uses RFC 1950 and gzip uses RFC 1952, so have different headers but the rest have the same structure and follows the RFC 1951.

Wynnie answered 2/5, 2013 at 16:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.