Under what conditions does Huffman encoding make a string not compressible? Is it when all the characters appear with equal frequency/probability? And if so, how can one show this is true?
In a nutshell, Huffman encoding assigns smaller bit-length codes to more probable binary combinations and longer ones to the less probable ones. If all are equally likely, you will find there is no real advantage because the compression due to shorter codes is lost due to equally likely longer codes.
You can calculate a simple zero-order entropy for a sequence of symbols which will tell you if you even have a chance of significant compression with just Huffman coding. (I wish stackoverflow had TeX formatting like math.stackexchange.com does. I can't write decent equations here.)
Count how many different symbols you have and call that n, with the symbols numbered 1..n. Compute the probability of each symbol, which is how many times each symbol occurs divided by the length of the sequence, and call that p(k). Then the best you can do with zero-order coding is an average number of bits per symbol equal to: -sum(p(k)log(p(k)),k=1..n)/log(2). Then you can compare the result to log(n)/log(2) which is what the answer would be if all the probabilities were equal (1/n) to see how much the unequal probabilities could buy you. You can also compare the result to, for example, 8, if you are currently storing the symbols as a byte each (in which case n <= 256).
A Huffman code will have equal to or more bits per symbol than that entropy. You also need to take into account how you will convey the Huffman code to the receiver. You will need some sort of header describing the code, which will take more bits. An arithmetic or range code could get closer to the entropy than the Huffman code, especially for very long sequences.
In general, a Huffman code by itself will not produce very satisfying compression. A quick test on the 100M character English text test file enwik8 gives an entropy of about five bits per symbol, as does Huffman coding of the text. Huffman (or arithmetic or range) coding needs to be used in combination with a higher-order model of the input data. These models can be simple string matching, like LZ77 as used in deflate or LZMA, a Burrows-Wheeler transform, or prediction by partial matching. An LZ77 compressor, in this case gzip, gets less than three bits per symbol.
I can't resist including a picture of Boltzmann's gravestone, engraved on which is his formula that connects entropy to probability, essentially the formula above.
In a nutshell, Huffman encoding assigns smaller bit-length codes to more probable binary combinations and longer ones to the less probable ones. If all are equally likely, you will find there is no real advantage because the compression due to shorter codes is lost due to equally likely longer codes.
Two factors come to my mind:
- If you have similar probabilities of elements, then little compression will be possible
- If you try to compress a small input (say, a short text), then the overhead of attaching a Huffman look-up table (aka dictionary - you need to decode your compressed file, don't you?) can make the final size even bigger than the original input.
© 2022 - 2024 — McMap. All rights reserved.