get the filesize of very large .gz file on a 64bit platform
Asked Answered
D

3

7

According to the specifiction of gz the filesize is saved in the last 4bytes of a .gz file.

I have created 2 files with

dd if=/dev/urandom of=500M bs=1024 count=500000
dd if=/dev/urandom of=5G bs=1024 count=5000000

I gziped them

gzip 500M 5G

I checked the last 4 bytes doing

tail -c4 500M|od -I      (returns 512000000 as expected)
tail -c4 5G|od -I        (returns 825032704 as not expected)

It seems that hitting the invisible 32bit barrier, makes the value written into the ISIZE completely nonsense. Which is more annoying, than if they had used some error bit instead.

Does anyone know of a way to get the uncompressed .gz filesize from the .gz without extracting it?

thanks

specification: http://www.gzip.org/zlib/rfc-gzip.html

edit: if anyone to try it out, you could use /dev/zero instead of /dev/urandom

Debut answered 27/12, 2009 at 9:18 Comment(1)
dd seek=10G if=/dev/zero of=out.dat count=0 is more handy for the most filesystemsJuvenal
S
8

There isn't one.

The only way to get the exact size of a compressed stream is to actually go and decompress it (even if you write everything to /dev/null and just count the bytes).

Its worth noting that ISIZE is defined as

ISIZE (Input SIZE)
This contains the size of the original (uncompressed) input
data modulo 2^32.

in the gzip RFC so it isn't actually breaking at the 32-bit barrier, what you're seeing is expected behavior.

Seibert answered 27/12, 2009 at 9:26 Comment(0)
C
4

I haven't tried this with a file of the size you mentioned, but I often find the uncompressed size of a .gz file with

zcat file.gz | wc -c

when I don't want to leave the uncompressed file lying around, or bother to compress it again.

Obviously, the data is uncompressed, but is then piped to wc.

It's worth a try, anyway.

EDIT: When I tried creating a 5G file with data from /dev/random it produced a file 5G of size 5120000000, although my file manager reported this as 4.8G

Then I compressed it with gzip 5G, the results 5G.gz was the same size (not much compression of random data).

Then zcat 5G.gz | wc -c reported the same size as the original file: 5120000000 bytes. So my suggestion seemed to have worked for this trial, anyway.

Thanks for waiting

Cohberg answered 27/12, 2009 at 9:24 Comment(4)
Yes thanks, but my question was more in the sense of. How do I get the uncompressed filesize without actually doing a decompression. For files smaller than 32bit files. You can just extract the last 4 bytes. This is not possible for larger files, and as you have done , the only way is to do a decompression.Debut
But my method performed a decompression which didn't affect the original compressed file, and didn't create an extra uncompressed file. There would be no cleaning up afterward. And I think it's worth noting that the answer you accepted said that decompression was the only way to get the exact size. It makes sense that the only way to find out what's in the box, is to open it.Cohberg
Yes, it didn't affect the original file, but my concern was not "not touching" the file, but merely a speed issue. If I want to allocate an array for the entire data, then I should know the size. That requires doing a decompression, followed by another decompression for the actual datacopy. This is not necessary if the file is smaller than 2.1 gig. std gunzip can also decompress to stdout, doing gunzip -c file |wc -c But thanks for your input :)Debut
all comments aside: if all else fails a practical solution.Betake
B
0

gzip does have a -l option:

       -l --list
          For each compressed file, list the following fields:

              compressed size: size of the compressed file
              uncompressed size: size of the uncompressed file
              ratio: compression ratio (0.0% if unknown)
              uncompressed_name: name of the uncompressed file

          The uncompressed size is given as -1 for files not in gzip format, such as compressed .Z files. To
          get the uncompressed size for such a file, you can use:

              zcat file.Z | wc -c

          In combination with the --verbose option, the following fields are also displayed:

              method: compression method
              crc: the 32-bit CRC of the uncompressed data
              date & time: time stamp for the uncompressed file

          The compression methods currently supported are deflate, compress, lzh (SCO compress -H) and pack.
          The crc is given as ffffffff for a file not in gzip format.

          With --name, the uncompressed name,  date and time  are those stored within the compress  file  if
          present.

          With --verbose, the size totals and compression ratio for all files is also displayed, unless some
          sizes are unknown. With --quiet, the title and totals lines are not displayed.
Betake answered 17/10, 2013 at 20:15 Comment(1)
This solution works only for a disk file, not a stream (the original question did not specify a stream, so in that respect it's a viable answer). Unfortunately, for file sizes larger than 2^32-1 bytes, the uncompressed size is shown modulo 2^32 and so is unreliable.Greenness

© 2022 - 2024 — McMap. All rights reserved.