Why would gnu parallel chunking improve gzip's compression size?
Asked Answered
C

3

9

File under: "Unexpected Efficiency Dept."

The first 90 million numbers take up about 761MB, as output by:

 seq 90000000

According to man parallel, it can speed up gzip's archiving big files by chopping the input up, and using different CPUs to compress the chunks. So even though gzip is single-threaded this technique makes it multi-threaded:

seq 90000000  | parallel --pipe --recend '' -k gzip -9 >bigfile.gz

Took 46 seconds, on an Intel Core i3-2330M (4) @ 2.2GHz.

Pipe that to plain old gzip:

seq 90000000  | gzip -9 > bigfile2.gz

Took 80 seconds, on the same CPU. Now the surprise:

ls -log bigfile*.gz

Output:

-rw-rw-r-- 1 200016306 Jul  3 17:27 bigfile.gz
-rw-rw-r-- 1 200381681 Jul  3 17:30 bigfile2.gz

300K larger? That didn't look right. First I checked with zdiff if the files had the same contents -- yes, the same. I'd have supposed any compressor would do better with a continuous data stream than a chunked one. Why isn't bigfile2.gz smaller than bigfile.gz?

Corrigan answered 4/7, 2016 at 7:7 Comment(4)
Interestingly on my iMac, bigfile2.gz comes out smaller and the elapsed time is almost identical for parallel and standard invocation.Deceit
@MarkSetchell For some reason the Mac OS X seq does not produce the same output. You can try jot instead.Omophagia
It may be relevant to note that pigz comes out smaller and faster than parallel+gzip (198345773 here, against 200381681 from gzip, and 52s user and 6½s real, against 36½s user and real).Tetroxide
parallel --pipe is inefficient. Use parallel --pipepart if possible (it is not in this case, because you read from a pipe, but it you had a file, --pipepart would be faster).Hebdomadary
O
9

The reason is that for this particular, rather unusual input, smaller deflate blocks are better than larger ones. By default gzip uses larger deflate blocks, since that works best for normal input data. The parallel command is forcing a few smaller deflate blocks by breaking up the input every 1 MB, resulting in a small gain. Though most of the blocks are still the same size.

You can do much better by setting a smaller block size for every block by using zlib's memLevel parameter in deflateInit2(). Here I compress the same output in a single thread each time, using memLevel values from 9 to 2, where a smaller memLevel is a smaller deflate block size (note that zlib does a little better than your gzip at the default level):

  • 9 - 199688429
  • 8 - 198554111 (default)
  • 7 - 191582070
  • 6 - 184880482
  • 5 - 181295029
  • 4 - 180137425 (optimum for this input)
  • 3 - 181176610
  • 2 - 185759115

The optimum memLevel for this data turns out to be 4, for which the compressed data is 12 MB (9%) smaller than for the default memLevel of 8. For memLevel 8, the deflate block size is 16383 symbols, whereas for memLevel 4, the deflate block size is 1023 symbols. One symbol is either a literal byte or a match.

The improvement comes from the extremely regular nature of the input, resulting in a regular sequence of match and literal commands. The smaller the block size, the fewer such distinct commands that appear, which then takes fewer bits to code each of them. This is still true for memLevel 3, but by then the overhead of the code description at the beginning of each deflate block cancels the improvement from fewer distinct codes.

zopfli is a deflate compressor that optimizes the block size and the commands selected, and managed to compress it to 100,656,812 bytes. It took three and a half hours though! zopfli is invoked with pigz using compression level 11.

Omophagia answered 4/7, 2016 at 15:10 Comment(7)
Just to be clear, the zlib memlevel 2-9 options are not the same as the gzip's compression speed -# (1-9) options, correct?Corrigan
Correct. The 1-9 is a compression level, which controls how hard the compressor searches for matching strings. In fact, for this input the default level of 6 compresses better than 9! But that's a story for another time.Omophagia
Something about this type of data makes 1023 symbols better. Would a finer grained setting (say 1013 symbols, etc.) compress to some smaller optimum? Also is the 1023 peculiar to the size of the data set, that is, would 1023 symbols remain optimal if there were 9 million numbers, or 900 million? Answer: Testing some smaller values than 90 mil., 9mil., 900K, 90K: parallel generally seems to do a little better than gzip. 900 Mil. also gives parallel the minor win.Corrigan
You could do better with a smaller block size, if fewer distinct commands were used. I am imagining constructing a deflate stream by hand for this data, and it would have very small blocks with one number to introduce each new sequence of a 1000 numbers, and then a block with just matches for the other 999. See my note on zopfli, which optimized this. I'll check later what block sizes it used.Omophagia
Turns out parallel has a -block <size> option, which sets the chunk size. Testing on a list of 90000 (half a meg of data), the best block size for compression is about 1024 bytes, but the overhead for parallel's splitting and whatnot makes it take 40x longer.Corrigan
actually, this is very interesting. probably it is possible to recompress 9*9 times and see which is the best result...Powerdive
@ShimonDoodkin Go for it.Omophagia
P
0

I think it is frequency of dictionary making, which is different. This is the balance between speed and compression efficiency, like gzip vs lzma.

I guess it is more frequent in the split case. So the numbers of the dictionary are more similar to the following.

There was one 20 minute lecture on YouTube, Raul Fraile: How GZIP compression works | JSConf EU 2014.

Powerdive answered 4/7, 2016 at 9:6 Comment(1)
Re: "The following." It is not too clear what noun-object the following signifies. Sorry, but the Raul Fraile lecture, delivered with a thick Spain accent in a timid soft monotone by a self confessed non-expert in compression, is too slow for my American ears accustomed to fast talkers -- it would be better to just quote the part you think is relevant, or link to only the most relevant segment of the video.Corrigan
T
0

The effect is likely due to compression block size. Compressing the same input stream with a range of settings like this:

for i in {1..9}; do seq 90000000 | gzip -$i >$i.gz; done

gives file sizes that reach a minimum at gzip -5:

-rw-r--r-- 1 203473375 Jul  4 16:39 1.gz
-rw-r--r-- 1 201160853 Jul  4 16:40 2.gz
-rw-r--r-- 1 200181562 Jul  4 16:40 3.gz
-rw-r--r-- 1 204266147 Jul  4 16:40 4.gz
-rw-r--r-- 1 199144028 Jul  4 16:40 5.gz
-rw-r--r-- 1 199688429 Jul  4 16:40 6.gz
-rw-r--r-- 1 199689546 Jul  4 16:41 7.gz
-rw-r--r-- 1 200376213 Jul  4 16:41 8.gz
-rw-r--r-- 1 200381681 Jul  4 16:42 9.gz

That's not far off gzip's default of -6.

Tetroxide answered 4/7, 2016 at 15:54 Comment(4)
No, that is not the effect here. The compression level is not being changed. Furthermore, the compression level does not change the block size. You are seeing yet another effect, which is the higher compression level finding longer matches, but that improvement being countered by a larger number of distinct lengths and distances, requiring more bits per match to code.Omophagia
I thought that the gzip program changed block size when it set compression level, but I now sit corrected. Thanks @Mark for correcting me!Tetroxide
Trivia: wasting 15 minutes of CPU making a comparative parallel vs plain gzip table, time for f in {1..9} ; do echo $f" " $(seq 90000000 | gzip -$f | wc -c) " " $(seq 90000000 | parallel --pipe --recend '' -k gzip -$f | wc -c) ; done, reveals that plain gzip is a little smaller for -1 through -3, and larger thereafter. parallel reaches its minimum with gzip -5 at 198735045 bytes.Corrigan
More trivia: Adding pigz to that loop, $(seq 90000000 | pigz -$f | wc -c), shows it's sweet spot is also -5 at 197271587 bytes. pigz is smallest every time, except for -2 where it comes in 2nd place after gzip.Corrigan

© 2022 - 2024 — McMap. All rights reserved.