I need to choose a compression algorithm
Asked Answered
T

6

7

I need to choose a compression algorithm to compress some data. I don't know the type of data I'll be compressing in advance (think of it as kinda like the WinRAR program).

I've heard of the following algorithms but I don't know which one I should use. Can anyone post a short list of pros and cons? For my application the first priority is decompression speed; the second priority is space saved. Compression (not decompression) speed is irrelevant.

  • Deflate
  • Implode
  • Plain Huffman
  • bzip2
  • lzma
Toler answered 7/3, 2010 at 18:52 Comment(2)
Some languages have built-in support for some (maybe all) of these, so you could do some quick testing. I guess this is hard if you don't know the type of data you're going to be compressing, but hopefully you have some idea, or some way of randomly generating data that is close to what you'll be using, in some way.Racquelracquet
that is 'some data'? any hint?Hamelin
D
5

If you need high decompression speed then you should be using LZO. Its compression speed and ratio are decent, but it's hard to beat its decompression speed.

Dailey answered 7/3, 2010 at 18:55 Comment(1)
wow thanks! Never heard of it before but it seems exactly what I'm looking for.Toler
H
12

I ran a few benchmarks compressing a .tar that contained a mix of high entropy data and text. These are the results:

Name  - Compression rate* - Decompression Time
7zip  - 87.8%             - 0.703s
bzip2 - 80.3%             - 1.661s
gzip  - 72.9%             - 0.347s
lzo   - 70.0%             - 0.111s

*Higher is better

From this I came to the conclusion that the compression rate of an algorithm depends on its name; the first in alphabetical order will be the one with the best compression rate, and so on.

Therefore I decided to rename lzo to 1lzo. Now I have the best algorithm ever.


EDIT: worth noting that of all of them unfortunately lzo is the only one with a very restrictive license (GPL) :(

Harte answered 7/3, 2010 at 20:5 Comment(2)
+1 for an astounding discovery! It seems that this holds true also for video compression, but in the reverse direction: x264 > XviD > DivX ;) (case sensitive sorting ;))Niels
"very restrictive license (GPL)" -- /me revives age-old BSD vs GPL argument: restrictive for you the developer but not for the end user! :)Tumbling
D
5

If you need high decompression speed then you should be using LZO. Its compression speed and ratio are decent, but it's hard to beat its decompression speed.

Dailey answered 7/3, 2010 at 18:55 Comment(1)
wow thanks! Never heard of it before but it seems exactly what I'm looking for.Toler
G
5

In the Linux kernel it is well explained (from those included):

  • Deflate (gzip) - Fast, worst compression
  • bzip2 - Slow, middle compression
  • lzma - Very slow compression, fast decompression (however slower than gzip), best compression

I haven't use others, so it is hard to say, but speeds of algorithms may depend largely on architecture. For example, there are studies that data compression on the HDD speeds the I/O, as the processor is so much faster than the disk that it is worth it. However, it depends largely on the size of bottlenecks.

Similarly, one algorithm may use memory extensively, which may or may not cause problems (12 MiB -- is it a lot or very small? On embedded systems it is a lot; on a modern x86 it is tiny fragment of memory).

Goldman answered 7/3, 2010 at 19:1 Comment(2)
Note that deflate is a data format while gzip is a file format that uses deflate to compress the files.Gooseneck
The option in Linux kernel is called gzip (how to compress kernel/initrd). So I included it as well.Goldman
B
2

Take a look at 7zip. It's open source and contains 7 separate compression methods. Some minor testing we've done shows the 7z format gives a much smaller result file than zip and it was also faster for the sample data we used.

Since our standard compression is zip, we didn't look at other compression methods yet.

Brazzaville answered 7/3, 2010 at 19:3 Comment(0)
S
1

For a comprehensive benchmark on text data you might want to check out the Large Text Compression Benchmark.

For other types, this might be indicative.

Settler answered 7/3, 2010 at 18:55 Comment(2)
And what about other types of data?Gooseneck
However, I should add that it generally depends on the data. (That is why it is only indicative.)Settler
T
1

One of the fastest compression algorithms these days is LZ4, reportedly reaching RAM speed limits during decompression.

On the other hand, an algorithm generally providing the best compression ratios is LZMA2, used by xz and 7z. However, two caveats:

A good balance is provided by Zstandard, which is fast but can also provide ratios competitive with LZMA.

Another popular option nowadays is Brotli, which is more focused on speed rather than achieving the highest compression ratio. Support for both Zstd and Brotli Content-Encodings have recently been added to the HTTP protocol.

The winner in benchmarks used to be PAQ, however it isn't widely used and I couldn't find an actively maintained implementation of it.

Toaster answered 9/4 at 14:15 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.