Is this a bug in this gzip inflate method?
Asked Answered
B

2

8

When searching on how to inflate gzip compressed data on iOS, the following method appears in number of results:

- (NSData *)gzipInflate
{
    if ([self length] == 0) return self;

    unsigned full_length = [self length];
    unsigned half_length = [self length] / 2;

    NSMutableData *decompressed = [NSMutableData dataWithLength: full_length + half_length];
    BOOL done = NO;
    int status;

    z_stream strm;
    strm.next_in = (Bytef *)[self bytes];
    strm.avail_in = [self length];
    strm.total_out = 0;
    strm.zalloc = Z_NULL;
    strm.zfree = Z_NULL;

    if (inflateInit2(&strm, (15+32)) != Z_OK) return nil;
    while (!done)
    {
        // Make sure we have enough room and reset the lengths.
        if (strm.total_out >= [decompressed length])
            [decompressed increaseLengthBy: half_length];
        strm.next_out = [decompressed mutableBytes] + strm.total_out;
        strm.avail_out = [decompressed length] - strm.total_out;

        // Inflate another chunk.
        status = inflate (&strm, Z_SYNC_FLUSH);
        if (status == Z_STREAM_END) done = YES;
        else if (status != Z_OK) break;
    }
    if (inflateEnd (&strm) != Z_OK) return nil;

    // Set real length.
    if (done)
    {
        [decompressed setLength: strm.total_out];
        return [NSData dataWithData: decompressed];
    }
    else return nil;
}

But I've come across some examples of data (deflated on a Linux machine with Python's gzip module) that this method running on iOS is failing to inflate. Here's what's happening:

In the last iteration of the while loop inflate() returns Z_BUF_ERROR and the loop is exited. But inflateEnd(), which is called after the loop, returns Z_OK. The code then assumes that since inflate() never returned Z_STREAM_END, the inflation failed and returns null.

According to this page, http://www.zlib.net/zlib_faq.html#faq05 Z_BUF_ERROR is not a fatal error, and my tests with limited examples show that the data is successfully inflated if the inflateEnd() returns Z_OK, even though the last call of inflate() did not return Z_OK. It seems like the inflateEnd() finished up inflating the last chunk of data.

I don't know much about compression and how gzip works, so I'm hesitant to make changes to this code without fully understanding what it does. I'm hoping someone with more knowledge about the topic can shed some light on this potential logic flaw in the code above, and suggest a way to fix it.

Another method that Google turns up, that seems to suffer from the same problem can be found here: https://github.com/nicklockwood/GZIP/blob/master/GZIP/NSData%2BGZIP.m

Edit:

So, it is a bug! Now, how to we fix it? Below is my attempt. Code review, anyone?

- (NSData *)gzipInflate
{
    if ([self length] == 0) return self;

    unsigned full_length = [self length];
    unsigned half_length = [self length] / 2;

    NSMutableData *decompressed = [NSMutableData dataWithLength: full_length + half_length];
    int status;

    z_stream strm;
    strm.next_in = (Bytef *)[self bytes];
    strm.avail_in = [self length];
    strm.total_out = 0;
    strm.zalloc = Z_NULL;
    strm.zfree = Z_NULL;

    if (inflateInit2(&strm, (15+32)) != Z_OK) return nil;

    do
    {
        // Make sure we have enough room and reset the lengths.
        if (strm.total_out >= [decompressed length])
            [decompressed increaseLengthBy: half_length];
        strm.next_out = [decompressed mutableBytes] + strm.total_out;
        strm.avail_out = [decompressed length] - strm.total_out;

        // Inflate another chunk.
        status = inflate (&strm, Z_SYNC_FLUSH);

        switch (status) {
            case Z_NEED_DICT:
                status = Z_DATA_ERROR;     /* and fall through */
            case Z_DATA_ERROR:
            case Z_MEM_ERROR:
            case Z_STREAM_ERROR:
                (void)inflateEnd(&strm);
                return nil;
        }
    } while (status != Z_STREAM_END);

    (void)inflateEnd (&strm);

    // Set real length.
    if (status == Z_STREAM_END)
    {
        [decompressed setLength: strm.total_out];
        return [NSData dataWithData: decompressed];
    }
    else return nil;
}

Edit 2:

Here's a sample Xcode project that illustrates the issue I'm running in. The deflate happens on the server side and the data is base64 and url encoded before being transported via HTTP. I've embedded the url encoded base64 string in the ViewController.m. The url-decode and base64-decode as well as your gzipInflate methods are in NSDataExtension.m

https://dl.dropboxusercontent.com/u/38893107/gzip/GZIPTEST.zip

Here's the binary file as deflated by python gzip library:

https://dl.dropboxusercontent.com/u/38893107/gzip/binary.zip

This is the URL encoded base64 string that gets transported over the HTTP: https://dl.dropboxusercontent.com/u/38893107/gzip/urlEncodedBase64.txt

Bryophyte answered 23/7, 2013 at 20:42 Comment(4)
The attempt goes into an infinite loop if the gzip stream is not complete.Cowgirl
By the way, "binary.zip" is not a zip file. It is a gzip file. The name should be "binary.gz".Cowgirl
The URL decodes to binary.zip (which should be called binary.gz), and the code I provided in my answer properly decompresses that to a 221213 byte text file. I did not look at your code to see what's wrong -- that's your job.Cowgirl
Thanks Mark, you helped more than I would have expected.Bryophyte
C
8

Yes, it's a bug.

It is in fact correct that if inflate() does not return Z_STREAM_END, then you have not completed inflation. inflateEnd() returning Z_OK doesn't really mean much -- just that it was given a valid state and was able to free the memory.

So inflate() must eventually return Z_STREAM_END before you can declare success. However Z_BUF_ERROR is not a reason to give up. In that case you simply call inflate() again with more input or more output space. Then you will get the Z_STREAM_END.

From the documentation in zlib.h:

/* ...
Z_BUF_ERROR if no progress is possible or if there was not enough room in the
output buffer when Z_FINISH is used.  Note that Z_BUF_ERROR is not fatal, and
inflate() can be called again with more input and more output space to
continue decompressing.
... */

Update:

Since there is buggy code floating around out there, below is the proper code to implement the desired method. This code handles incomplete gzip streams, concatenated gzip streams, and very large gzip streams. For very large gzip streams, the unsigned lengths in the z_stream are not large enough when compiled as a 64-bit executable. NSUInteger is 64 bits, whereas unsigned is 32 bits. In that case, you have to loop on the input to feed it to inflate().

This example simply returns nil on any error. The nature of the error is noted in a comment after each return nil;, in case more sophisticated error handling is desired.

- (NSData *) gzipInflate
{
    z_stream strm;

    // Initialize input
    strm.next_in = (Bytef *)[self bytes];
    NSUInteger left = [self length];        // input left to decompress
    if (left == 0)
        return nil;                         // incomplete gzip stream

    // Create starting space for output (guess double the input size, will grow
    // if needed -- in an extreme case, could end up needing more than 1000
    // times the input size)
    NSUInteger space = left << 1;
    if (space < left)
        space = NSUIntegerMax;
    NSMutableData *decompressed = [NSMutableData dataWithLength: space];
    space = [decompressed length];

    // Initialize output
    strm.next_out = (Bytef *)[decompressed mutableBytes];
    NSUInteger have = 0;                    // output generated so far

    // Set up for gzip decoding
    strm.avail_in = 0;
    strm.zalloc = Z_NULL;
    strm.zfree = Z_NULL;
    strm.opaque = Z_NULL;
    int status = inflateInit2(&strm, (15+16));
    if (status != Z_OK)
        return nil;                         // out of memory

    // Decompress all of self
    do {
        // Allow for concatenated gzip streams (per RFC 1952)
        if (status == Z_STREAM_END)
            (void)inflateReset(&strm);

        // Provide input for inflate
        if (strm.avail_in == 0) {
            strm.avail_in = left > UINT_MAX ? UINT_MAX : (unsigned)left;
            left -= strm.avail_in;
        }

        // Decompress the available input
        do {
            // Allocate more output space if none left
            if (space == have) {
                // Double space, handle overflow
                space <<= 1;
                if (space < have) {
                    space = NSUIntegerMax;
                    if (space == have) {
                        // space was already maxed out!
                        (void)inflateEnd(&strm);
                        return nil;         // output exceeds integer size
                    }
                }

                // Increase space
                [decompressed setLength: space];
                space = [decompressed length];

                // Update output pointer (might have moved)
                strm.next_out = (Bytef *)[decompressed mutableBytes] + have;
            }

            // Provide output space for inflate
            strm.avail_out = space - have > UINT_MAX ? UINT_MAX :
                             (unsigned)(space - have);
            have += strm.avail_out;

            // Inflate and update the decompressed size
            status = inflate (&strm, Z_SYNC_FLUSH);
            have -= strm.avail_out;

            // Bail out if any errors
            if (status != Z_OK && status != Z_BUF_ERROR &&
                status != Z_STREAM_END) {
                (void)inflateEnd(&strm);
                return nil;                 // invalid gzip stream
            }

            // Repeat until all output is generated from provided input (note
            // that even if strm.avail_in is zero, there may still be pending
            // output -- we're not done until the output buffer isn't filled)
        } while (strm.avail_out == 0);

        // Continue until all input consumed
    } while (left || strm.avail_in);

    // Free the memory allocated by inflateInit2()
    (void)inflateEnd(&strm);

    // Verify that the input is a valid gzip stream
    if (status != Z_STREAM_END)
        return nil;                         // incomplete gzip stream

    // Set the actual length and return the decompressed data
    [decompressed setLength: have];
    return decompressed;
}
Cowgirl answered 23/7, 2013 at 22:38 Comment(9)
Thanks Mark. Nothing like a response from zlib author himself! I've attempted to fix the bug (see edited question above if interested) by making the loop go until Z_STREAM_END is returned. But in the annotated example @Joachim linked, the inner loop is conditioned on strm.avail_out == 0, which I don't understand the reason behind.Bryophyte
Your fix doesn't always work. In particular, it will go into an infinite loop if fed an incomplete gzip stream. Also that approach in general will work only if the compressed and uncompressed lengths are small enough to fit in an unsigned type. And only if the total lengths will fit in an unsigned long type. There are more robust ways to write that method that do not depend on those assumptions. I will add correct code to my answer.Cowgirl
Looping on strm.avail_out == 0, or equivalently, waiting for strm.avail_out != 0, waits for all of the compressed data that can be generated from the provided compressed input. That's not done until it doesn't fill the output buffer. A little compressed data can sometimes generate a lot of uncompressed data, so you need a loop to pull all that out.Cowgirl
Thanks so much for taking the time to rewrite this code. You have no idea how many websites reference that function as a way to inflate gzip on iOS.Bryophyte
I tried your method with the data samples I have and I've found a few that don't get inflated properly. Basically what happens is that after a few iterations of the inner loop, the strm.avail_out ends up being > 0, so the loop exits but the status variable is Z_OK. At the same time left and strm.avail_in both equal to 0 so the outer look exits and the check for Z_STREAM_END that follows the loops returns null.Bryophyte
That means that the gzip stream was incomplete, as noted in the comment where it is returning nil.Cowgirl
It's deflated with python's gzip module on the server, then base64 encoded and transported over http, so maybe something gets messed up on the way. But the interesting thing is that one more iteration will make it get to Z_STREAM_END and I seem to get all the data out.Bryophyte
Can you provide an example gzip stream that is not inflated properly?Cowgirl
Sure, see the Edit 2 to the question above. Thanks for your interest in this. You've gone above and beyond already!Bryophyte
C
2

Yes, looks like a bug. According to this annotated example from the zlib site, Z_BUF_ERROR is just an indication that there is no more output unless inflate() is provided with more input, not in itself a reason to abort the inflate loop abnormally.

In fact, the linked sample seems to handle Z_BUF_ERROR exactly like Z_OK.

Canonicate answered 23/7, 2013 at 21:8 Comment(1)
@Bryophyte As long as you pass all input data in at once, I can't see a problem. The outer loop in the sample is for "streaming", ie if avail_out is 0, it tries to refill the input buffer and tries again until Z_STREAM_END is returned. Since you passed all data in at once and have nothing to refill with, I can't see you have any option but to retry until you get a hard error or Z_STREAM_END.Canonicate

© 2022 - 2024 — McMap. All rights reserved.