File under: "Unexpected Efficiency Dept."
The first 90 million numbers take up about 761MB, as output by:
seq 90000000
According to man parallel
, it can speed up gzip
's archiving big files by chopping the input up, and using different CPUs to compress the chunks. So even though gzip
is single-threaded this technique makes it multi-threaded:
seq 90000000 | parallel --pipe --recend '' -k gzip -9 >bigfile.gz
Took 46 seconds, on an Intel Core i3-2330M (4) @ 2.2GHz.
Pipe that to plain old gzip
:
seq 90000000 | gzip -9 > bigfile2.gz
Took 80 seconds, on the same CPU. Now the surprise:
ls -log bigfile*.gz
Output:
-rw-rw-r-- 1 200016306 Jul 3 17:27 bigfile.gz
-rw-rw-r-- 1 200381681 Jul 3 17:30 bigfile2.gz
300K larger? That didn't look right. First I checked with zdiff
if the files had the same contents -- yes, the same. I'd have supposed any compressor would do better with a continuous data stream than a chunked one. Why isn't bigfile2.gz
smaller than bigfile.gz
?
bigfile2.gz
comes out smaller and the elapsed time is almost identical for parallel and standard invocation. – Deceitseq
does not produce the same output. You can tryjot
instead. – Omophagiapigz
comes out smaller and faster thanparallel
+gzip
(198345773 here, against 200381681 fromgzip
, and 52s user and 6½s real, against 36½s user and real). – Tetroxideparallel --pipe
is inefficient. Useparallel --pipepart
if possible (it is not in this case, because you read from a pipe, but it you had a file, --pipepart would be faster). – Hebdomadary