GZipStream doesn't detect corrupt data (even CRC32 passes)?
Asked Answered
U

1

5

I'm using GZipStream to compress / decompress data. I chose this over DeflateStream since the documentation states that GZipStream also adds a CRC to detect corrupt data, which is another feature I wanted. My "positive" unit tests are working well in that I can compress some data, save the compressed byte array and then successfully decompress it again. The .NET GZipStream compress and decompress problem post helped me realize that I needed to close the GZipStream before accessing the compressed or decompressed data.

Next, I continued to write a "negative" unit test to be sure corrupt data could be detected. I had previously used the example for the GZipStream class from MSDN to compress a file, open the compressed file with a text editor, change a byte to corrupt it (as if opening it with a text editor wasn't bad enough!), save it and then decompress it to be sure that I got an InvalidDataException as expected.

When I wrote the unit test, I picked an arbitrary byte to corrupt (e.g., compressedDataBytes[50] = 0x99) and got an InvalidDataException. So far so good. I was curious, so I chose another byte, but to my surprise I did not get an exception. This may be okay (e.g., I coincidentally hit an unused byte in a block of data), so long as the data could still be recovered successfully. However, I didn't get the correct data back either!

To be sure "it wasn't me", I took the cleaned up code from the bottom of .NET GZipStream compress and decompress problem and modified it to sequentially corrupt each byte of the compressed data until it failed to decompress properly. Here's the changes (note that I'm using the Visual Studio 2010 test framework):

// successful compress / decompress example code from:
//    https://mcmap.net/q/430166/-net-gzipstream-compress-and-decompress
[TestMethod]
public void Test_zipping_with_memorystream_and_corrupting_compressed_data()
{
   const string sample = "This is a compression test of microsoft .net gzip compression method and decompression methods";
   var encoding = new ASCIIEncoding();
   var data = encoding.GetBytes(sample);
   string sampleOut = null;
   byte[] cmpData;

   // Compress 
   using (var cmpStream = new MemoryStream())
   {
      using (var hgs = new GZipStream(cmpStream, CompressionMode.Compress))
      {
         hgs.Write(data, 0, data.Length);
      }
      cmpData = cmpStream.ToArray();
   }

   int corruptBytesNotDetected = 0;

   // corrupt data byte by byte
   for (var byteToCorrupt = 0; byteToCorrupt < cmpData.Length; byteToCorrupt++)
   {
      // corrupt the data
      cmpData[byteToCorrupt]++;

      using (var decomStream = new MemoryStream(cmpData))
      {
         using (var hgs = new GZipStream(decomStream, CompressionMode.Decompress))
         {
            using (var reader = new StreamReader(hgs))
            {
               try
               {
                  sampleOut = reader.ReadToEnd();

                  // if we get here, the corrupt data was not detected by GZipStream
                  // ... okay so long as the correct data is extracted
                  corruptBytesNotDetected++;

                  var message = string.Format("ByteCorrupted = {0}, CorruptBytesNotDetected = {1}",
                     byteToCorrupt, corruptBytesNotDetected);

                  Assert.IsNotNull(sampleOut, message);
                  Assert.AreEqual(sample, sampleOut, message);
               }
               catch(InvalidDataException)
               {
                  // data was corrupted, so we expect to get here
               }
            }
         }
      }

      // restore the data
      cmpData[byteToCorrupt]--;
   }
}

When I run this test, I get:

Assert.AreEqual failed. Expected:<This is a compression test of microsoft .net gzip compression method and decompression methods>. Actual:<>. ByteCorrupted = 11, CorruptBytesNotDetected = 8

So, this means there were actually 7 cases where corrupting the data made no difference (the string was successfully recovered), but corrupting byte 11 neither threw an exception, nor recovered the data.

Am I missing something or doing soemthing wrong? Can anyone see why the corrupt compressed data is not being detected?

Uniflorous answered 26/2, 2012 at 19:51 Comment(10)
Does GZip have a CRC? (There might be many "corrupt bytes" that can't be detected while decompressing, as they might still result in a "valid" or "not currently yet invalid" stream.)Constellation
The doc's say it does have a CRC. Nonetheless, some bytes can be corrupted and the data is recovered successfully (which is okay... the 7 cases noted above). However, if it doesn't throw an exception and it doesn't recover the data properly, then I think this is a problem (which is where the iterations in the test stop).Uniflorous
Check the exceptions that are thrown: it's not always InvalidDataExceptionConstellation
However, I can't explain why the CRC32 doesn't cause an exception to be thrown in all cases (it should be much more reliable than it seems) even though it does detect CRC errors in, say, a measly 60-80% of the invalid cases "over 11" that don't cause a Huffman lookup exception or other internal decoding error. I was under the impression that CRC32 should result in detection rates of over 99.999% (and then some), which does not seem to be the case here :-Constellation
I'm only catching InvalidDataException, so any other would cause the test to fail because of that exception. However, the test only fails because no exception was thrown and the decompressed data is not as expected. Thanks for the idea though. I still feel like I'm missing something, but the test seems pretty straight forward.Uniflorous
I was wrong, so I updated/removed the old comments. The test stops because an IndexOutOfRangeException is thrown in certain cases (not an InvalidDataException) when the Huffman table lookups fail (bad Microsoft leaking details!).Constellation
In the mean time, I've "wrapped" the compressed data result in a simple CRC16 checksum and it catches 100% of the bytes corrupted (same test as above) as I expected. This leads me to think the "CRC" applied by GZipStream doesn't cover all the data. Even so, I wouldn't expect it to "pass" and yet still give bad data.Uniflorous
Per specification it should be all of the uncompressed data (CRC32+length) :( Why the catch rates are so low is ... surprising and not desirable. It would be interesting to see what other (better) .NET Gzip implementations do; it might be just (another) GZipStream failing. (I tested on .NET3.5 SP1)Constellation
That's a good idea, putting other classes to the test. I did try DeflateStream and got similar results. I may try some other solutions tomorrow. Thanks for your thoughts and ideas.Uniflorous
I think this problem "runs deeper". I have added a sister-question: #9472326Constellation
C
8

There is a 10-byte header in the gzip format, for which the last seven bytes can be changed without resulting in a decompression error. So the seven cases you noted with no corruption are expected.

It should be vanishingly rare to not detect an error with a corruption anywhere else in the stream. Most of the time the decompressor will detect an error in the format of the compressed data, never even getting to the point of checking the crc. If it does get to the point of checking a crc, that check should fail very nearly all the time with a corrupted input stream. ("Nearly all the time" means a probability of about 1 - 2^-32.)

I just tried it (in C with zlib) using your sample string, which produces an 84-byte gzip stream. Incrementing each of the 84 bytes leaving the remainder the same, as you did, resulted in: two incorrect header checks, one invalid compression method, seven successes, one invalid block type, four invalid distances set, seven invalid code lengths set, four missing end-of-block, 11 invalid bit length repeat, three invalid bit length repeat, two invalid bit length repeat, two unexpected end of stream, 36 incorrect data check (that's the actual CRC error), and four incorrect length check (another check in the gzip format for the correct uncompressed data length). In no cases was a corrupted compressed stream not detected.

So there must be a bug somewhere, either in your code or in the class.

Update:

It appears that there is a bug in the class.

Remarkably (or maybe not remarkably), Microsoft has concluded that they won't fix this bug!

Connotative answered 27/2, 2012 at 5:17 Comment(12)
The .net GZipStream class produced 178 bytes of compressed data. Changing 45 of the bytes did not throw an exception and in 38 of these cases, the data was not recovered either. It's strongly pointing to a bug in the framework class, especially since you've tested the approach and another completely different implementation. Nonetheless, I'll review the code with someone else tomorrow. Thanks for taking the time to try this out.Uniflorous
@Mark Adler the issue is that the CRC32 check does not seem to be working within excepted reliability bounds :( Try the posted example (but catch Exception, not InvalidDataException) and you'll see that only sometimes is the CRC determined to be invalid ... which is odd. (Sometimes the internal Huffman tables usage generates an IOBE, which is why I suggest changing it to catch Exception for testing, but that's another issue that can cause the reader to vomit before getting to the CRC check.)Constellation
I do not have a ".NET" platform, so I can't try this directly.Connotative
A 94-byte input to gzip will never produce a 178-byte output. The most you could get for incompressible data would be 117 bytes out. (That is the original 94 plus five bytes to mark the stored data plus 18 bytes for the header and trailer.) I have no idea what this GZipStream stream class is producing, since it does not seem like it is producing what gzip produces.Connotative
I fully understand that the poster is not getting a report back of a crc-32 mismatch or an invalid format. That is not a characteristic of the gzip format. The gzip format provides the information needed to detect the corruptions the poster is testing very nearly all of the time. Therefore there is a bug in the class or the application using it, assuming that the GZipStream class actually generates the gzip format.Connotative
@Mark Adler That sanity is not true in the standard .NET BCL implementaion, unfortunately -- GZipStream is notoriously bad for not doing this "obvious optimization" (as well as having inflated / "non ideal" dictionaries, etc.) :( After running my own tests, I would not rule out "a bad implementation"; an implementation of GZip that only sometimes detects the corruption via the CRC32 is ... odd. (And it does sometimes, say 60-80%, detect the corruption.)Constellation
(So, yes, I entirely agree with your response, but it does leave out the why and where bits :-)Constellation
I reviewed the test with someone else and we couldn't see anything fundamentally wrong. The StreamReader was a bit suspicious, so I got rid of it and used a simple byte array to avoid further string conversions, but the results were similar. I agree we should practically "never" miss corrupt data with a CRC (we use it often in many protocols). I'll have to apply my own checksum, or use another library. I'm not too concerned with the performance on small data sets since the application will be with much larger ones, but I appreciate knowing the performance relative to other implementations.Uniflorous
Since it pretty much seems like a bug at this point, I'll report this on Microsoft Connect too.Uniflorous
@AndrewTeare I knoe this is old but just hit the same issue. Is connect.microsoft.com/VisualStudio/feedback/details/726860/… your connect case?Hale
Yes, that's the issue I reported. I should have put a link here too... thanks for adding it. Too bad it "won't be fixed". I ended up using the GZipStream, but calculated a checksum and saved that with the data to detect corrupt data.Uniflorous
@AndrewTeare The link is broken. MS has done a lot of mess with its issue tracker. I would read MS answer, if you help me find it or repost it here. Especially after Mark Adler cited this issue there: #17213464Maclay

© 2022 - 2024 — McMap. All rights reserved.