is partial gz decompression possible?

Asked 15/5, 2014 at 10:50 Answered 10/1, 2018 at 14:46

For working with images that are stored as .gz files (my image processing software can read .gz files for shorter/smaller disk time/space) I need to check the header of each file.

The header is just a small struct of a fixed size at the start of each image, and for images that are not compressed, checking it is very fast. For reading the compressed images, I have no choice but to decompress the whole file and then check this header, which of course slows down my program.

Would it be possible to read the first segment of a .gz file (say a couple of K), decompress this segment and read the original contents? My understanding of gz is that after some bookkeeping at the start, the compressed data is stored sequentially -- is that correct?

so instead of
1. open big file F
2. decompress big file F
3. read 500-byte header
4. re-compress big file F

do
1. open big file F
2. read first 5 K from F as stream A
3. decompress A as stream B
4. read 500-byte header from B

I am using libz.so but solutions in other languages are appreciated!

Wiburg answered 15/5, 2014 at 10:50 Comment(0)

You can use gzip -cd file.gz | dd ibs=1024 count=10 to uncompress just the first 10 KiB, for example.

gzip -cd decompresses to the standard output.

Pipe | this into the dd utility.

The dd utility copies the standard input to the standard output. Sodd ibs=1024 sets the input block size to 1024 bytes instead of the default 512.

And count=10 Copies only 10 input blocks, thus halting the gzip decompression.

You'll want to do gzip -cd file.gz | dd count=1 using the standard 512 block size and just ignore the extra 12 bytes.

A comment highlights that you can use gzip -cd file.gz | head -c $((1024*10)) or in this specific case gzip -cd file.gz | head -c $(512). The comment that the original dd relies on gzip decompressing in 1024 doesn't seem to true. For example dd ibs=2 count=10 decompresses the first 20 bytes.

Blandishment answered 6/2, 2015 at 10:5 Comment(1)

Note that using dd this way depends on gzip writing in multiples of 1024 bytes, because dd is block-oriented (number of read system calls), not byte-oriented. Use head -c $((1024*10)) which is easier and more efficient. See the related How to partially extract zipped huge plain text file? – Caspar 10/1, 2018 at 14:30

Yes, it is possible.

But don't reinvent the wheel, the HDF5 database supports different compression algorithms (gz among them) and you can address different pieces. It is compatible with Linux and Windows and there are wrappers to many languages. It also supports reading and decompressing in parallel, that is very useful if you use high compression rates.

Here is a comparison of read speed using different compression algorithms from Python through PyTables:

Plot

Apetalous answered 15/5, 2014 at 10:59 Comment(2)

Thanks for the info and for confirming! My question is a bit more basic though: firstly I need to use the data and other software (.gz only) that I am given. Also, I could not see on the HD5 page where a partial decompression is applied/provided? That is the only thing I need; HDF5 looks like a very complex product. – Wiburg 15/5, 2014 at 12:28

The function that reads from a database is H5Dread, sitting in src/H5DIo.c You can read the source and see how they do it. More than that, I am sorry I cannot help you. – Apetalous 15/5, 2014 at 12:37

A Deflate stream can have multiple blocks back to back. But you can always decompress just the number of bytes you want, even if it's part of a larger block. The zlib function gzread takes a length arg, and there are various other ways to decompress a specific amount of plaintext bytes, regardless of how long the full stream is. See the zlib manual for a list of functions and how to use them.

It's not clear if you want to modify just the headers. (You mention recompressing the whole file, but option B doesn't recompress anything). If so, write headers in a separate Deflate block so you can replace that block without recompressing the rest of the image. Use Z_FULL_FLUSH when you call the zlib deflate function to write the headers. You probably don't need to record the compressed length of the headers anywhere; I think it can be computed when reading them to figure out which bytes to replace.

If you aren't modifying anything, recompressing the whole file makes no sense. You can seek and restart decompression from the start after finding headers you like...

Caspar answered 10/1, 2018 at 14:46 Comment(0)

Recommended topics

Hot tags