Is there an easy generalization of Huffman coding trees for situations where the resulting alphabet is not binary? For instance, if I wanted to compress some text by writing it out in ternary, I could still build up a prefix-free coding system for each character I as writing out. Would the straightforward generalization of the Huffman construction (using a k-ary tree rather than a binary tree) still work correctly and efficiently? Or does this construction lead to a highly inefficient coding scheme?
The algorithm still works and it's still simple — in fact Wikipedia has a brief reference to n-ary Huffman coding citing the original Huffman paper as a source.
It does occur to me, though, that just as Huffman is slightly suboptimal because it allocates an integer number of bits to each symbol (unlike e.g. Arithmetic coding), ternary Huffman should be a little bit more suboptimal because it has to allocate an integer number of trits. Not a show-stopper, especially for only 3, but it does indicate that n-ary Huffman will fall further behind other coding algorithms as you increase n.
As an empirical test, I constructed binary and trinary Huffman trees for the distribution of Scrabble tiles.
The entropy of the distribution shows you can't get better than 4.37 bits per letter.
The binary Huffman tree uses on average 4.41 bits per letter.
The trinary Huffman tree uses on average 2.81 trits per letter, which has the same information density as 4.45 bits per letter.
© 2022 - 2024 — McMap. All rights reserved.