Condition for single bit code for a character in Huffman code?
Asked Answered
C

2

9

This is a question I ran into in school settings, but it keeps bugging me so I decided to ask it here.

In Huffman compression, fixed-length sequences (characters) are encoded with variable-length sequences. The code sequence length depends on the frequencies (or probabilities) of the source characters.

My questions is: what is the minimum highest character frequency, with which that character will be encoded by a single bit?

Capillary answered 21/6, 2010 at 8:8 Comment(0)
C
8

It turns out that the answer is 0.4, that is, if the highest frequency p is p >= 0.4, 1-bit code for the corresponding character is guaranteed. In other words, this is a sufficient condition.

It is also true that p >= 1/3 is a necessary condition. That is, there may be examples where 0.4 > p >= 1/3, and the shortest code is 1-bit, but there are no cases like that if p < 1/3.

The way to reason about that is to look at the way the code tree is constructed, in particular at the frequencies of the 3 last surviving subtrees. A proof appears in Johnsen, "On the redundancy of binary Huffman codes", 1980 (unfortunately this is a paid link).

Capillary answered 24/6, 2010 at 7:22 Comment(2)
Except for the trivial binary case -- if there are only 2 symbols, Huffman always assigns 1 bit to each symbol, no matter what the frequencies are.Sara
There's a proof of this included as an appendix in this arxiv preprint. Unfortunately, I can't follow the reasoning... I don't understand why it's necessarily true that the node labelled "u" was merged before the nodes labelled "v1" and "v2".Primer
E
7

In general, about 50% of the incoming symbol stream would have to consist of a given symbol for Huffman to code it as a single bit. The reason for this is that due to how Huffman coding works (the one symbol's encoding cannot be the prefix of another), by encoding a symbol with a single bit, you are requiring that the first bit for every other symbol be the opposite value (i.e. if one symbol is encoded as 0, everything else must start with 1 plus at least one more bit). Since you're eliminating half of the possible encoding space for any given bit length, you need to be gaining a way to encode at least half of the symbols being input in order to break even.

Note that there is a special case where the symbol space only consists of 3 symbols. In such a case, whichever symbol has the greatest frequency will be encoded with 1 bit (since the other two will be the 2nd-bit variations of whichever first-bit value isn't chosen) - if 2 or more have equally greater probability, either one could be encoded. Thus, in the 3-symbol case it's possible that a symbol with, say, 34% probability could theoretically be coded as a single bit (say, 0) while the other two might have 33% or lower probabilities and be coded as 10 and 11.

So if you're considering all possibilities, then technically anything 1/3 or above could potentially be encoded as a single bit (in the 3-symbols case).

Expatriate answered 21/6, 2010 at 8:17 Comment(7)
One exception is 3 symbols of equal probability: they can be encoded as 0, 10, 11Chest
Yeah, I was already adding that in as a note. :) It's an edge case, but potentially relevant.Expatriate
There are other, related cases (1/3, 1/6, 1/6, 1/6, 1/6) but it's not true for all cases where one symbol has probability of 1/3. I'd love to see an answer that shows what the distinction is.Chest
Thanks, Amber, but I am not sure you reasoning is correct. For example, for 4 characters with frequencies (0.41, 0.2, 0.2, 0.19), I believe the encoding will (0, 10, 110, 111). This will give better compression than 4 two-bits characters, e.g. for 100 characters, 198 bits will be required instead of 200. However I am still not sure how to generalize this idea.Capillary
The reasoning is designed for general use cases, where the number of symbols is >>> 2^2, and thus most symbols will be encoded to more than 2-3 bits. For cases where the number of symbols is low, the "savings" from swapping from 2 to 1 bits for a character is relatively large even with lower frequencies, whereas for higher number of total symbols, the savings is lower unless the character occurs with a very high frequency. However, there is definitely a lower bound at 1/3 (except for the trivial case where the number of symbols is less than 3).Expatriate
@Chest - I believe this paper gives some reasoning as to why some but not all cases down to p=1/3 give single bit symbols, but it's going way over my head, so I can't summarize.Primer
I think your answer is wrong. As @daphshez's answer correctly asserts, if there is some character with frequency greater than 2/5 then you are guaranteed that there will be some character that is encoded with a single bit. Furthermore, if all of the characters have frequency less than 1/3 then you are guaranteed that there are no characters encoded with a single bit. If neither of those conditions are met, it is non-trivial to determine whether some character will be encoded with a single bit without building the Huffman tree.Manic

© 2022 - 2024 — McMap. All rights reserved.