Is this a bug in the Java GZipInputStream class?
Asked Answered
P

1

11

I noticed that some of my gzip decoding code seemed to be failing to detect corrupted data. I think that I have traced the problem to the Java GZipInputStream class. In particular, it seems that when you read the entire stream with a single 'read' call, corrupted data doesn't trigger an IOException. If you read the stream in 2 or more calls on the same corrupted data, then it does trigger an exception.

I wanted to see what the community here thought before I consider filing a bug report.

EDIT: I have modified my example because the last one did not as clearly illustrate what I perceive to be the issue. In this new example, a 10 byte buffer is gzipped, one byte of the gzipped buffer is modified, then it is ungzipped. The call to 'GZipInputStream.read' returns 10 as the number of bytes read, which is what you would expect for a 10 byte buffer. Nevertheless, the unzipped buffer is different than the original (due to the corruption). No exception is thrown. I did note that calling 'available' after the read returns '1' instead of '0' which it would if the EOF had been reached.

Here is the source:

  @Test public void gzip() {
    try {
      int length = 10;
      byte[] bytes = new byte[]{12, 19, 111, 14, -76, 34, 60, -43, -91, 101};
      System.out.println(Arrays.toString(bytes));

      //Gzip the byte array
      ByteArrayOutputStream baos = new ByteArrayOutputStream();
      GZIPOutputStream gos = new GZIPOutputStream(baos);
      gos.write(bytes);
      gos.finish();
      byte[] zipped = baos.toByteArray();

      //Alter one byte of the gzipped array.  
      //This should be detected by gzip crc-32 checksum
      zipped[15] = (byte)(0);

      //Unzip the modified array
      ByteArrayInputStream bais = new ByteArrayInputStream(zipped);
      GZIPInputStream gis = new GZIPInputStream(bais);
      byte[] unzipped = new byte[length];
      int numRead = gis.read(unzipped);
      System.out.println("NumRead: " + numRead);
      System.out.println("Available: " + gis.available());

      //The unzipped array is now [12, 19, 111, 14, -80, 0, 0, 0, 10, -118].
      //No IOException was thrown.
      System.out.println(Arrays.toString(unzipped));

      //Assert that the input and unzipped arrays are equal (they aren't)
      org.junit.Assert.assertArrayEquals(unzipped, bytes);
    } catch (IOException e) {
      e.printStackTrace();
    }
  }
Parimutuel answered 11/3, 2011 at 18:3 Comment(1)
+1 good question; well written, with self-contained, concise, RUNNABLE example. This is why you got such a quick answer :-)Paulapauldron
B
9

Decided to run the test:

What you have missed. gis.read(unzipped) returns 1, so it has read only a single byte. You can't complain, it's not the end of the stream.

The next read() throws "Corrupt GZIP trailer".

So it's all good! (and there are no bugs, at least in GZIPInputStream)

Battles answered 11/3, 2011 at 18:3 Comment(8)
+1 Beat me to it :-) At least I had fun poking around in the JDK :-)Paulapauldron
actually i was over 99% sure what happens when i saw the code. byte[]{0,0,-1,-1} is the mark for Z_SYNC_FLUSH, so thought he might have hit it.Battles
I haven't ever used GZIPInputStream, so I had to trace into it. I think the key is that the corruption was in the trailer and not the data?Paulapauldron
Why does it stop reading instead of continuing to the end and throwing an exception the first time?Parimutuel
@Jacob, I will leave the test for you, but here is a hint Z_SYNC_FLUSH is used to flush the stream. I am not sure if that's correct, I can go into and debug it for the exact reason (yet, I am not interested) but if you want to go into debug mode, look for jzlib and just use it instead, you can play around in full java mode.Battles
It seems to me that GZipInputStream is not obeying the contract of the 'read' method: "This method blocks until input data is available, end of file is detected, or an exception is thrown." From download.oracle.com/javase/6/docs/api/java/io/…Parimutuel
This method blocks until input data is available, so where is SOME data, the method is not required to return all data. So it's fine imo: see the doc of the gzip itself: the method will block until some input can be decompressedBattles
I changed my example to one where 'read' returns the same number of bytes as would be expected.Parimutuel

© 2022 - 2024 — McMap. All rights reserved.