Huffman trees for non-binary alphabets?
Asked Answered
V

2

7

Is there an easy generalization of Huffman coding trees for situations where the resulting alphabet is not binary? For instance, if I wanted to compress some text by writing it out in ternary, I could still build up a prefix-free coding system for each character I as writing out. Would the straightforward generalization of the Huffman construction (using a k-ary tree rather than a binary tree) still work correctly and efficiently? Or does this construction lead to a highly inefficient coding scheme?

Velites answered 27/3, 2011 at 21:18 Comment(8)
Obvious approach is to try it out on some data with 3-ary and 4-ary trees, and compare compression to the standard huffman encoding and the entropy of the data. I actually kinda expect it to be a better approximation of the entropy than standard huffman, but that's just a guess.Storytelling
Probably in that case end nodes of tree will have 3 leafs instead of 2 and everything else will stay same.Milzie
To whoever downvoted - can you please explain what I can do to improve this question?Velites
It was not me, but I suspect that people get frustrated when someone ask question to which simple Google query would return plenty of answers and many times before on Stack Overflow.Milzie
@Alexei: certainly I get frustrated if people downvote questions for being duplicated, but don't vote them for closure as duplicates.Phototransistor
Closing it as duplicate make sense, but how how do I do that?Milzie
@Alexei: Unfortunately in your case, step 1 is "get 3000 reputation". Step 2 is "find a question that it's a dupe of", so if you provide a link to that, then others who have the rep can vote this question closed.Phototransistor
Funny enough, three years after this conversation initiated, the first result that I get for "non-binary Huffman" in Google search is this answer in Stackoverflow. Quite often downvoters react in an epileptic way, and that's frustrating.Aware
S
7

The algorithm still works and it's still simple — in fact Wikipedia has a brief reference to n-ary Huffman coding citing the original Huffman paper as a source.

It does occur to me, though, that just as Huffman is slightly suboptimal because it allocates an integer number of bits to each symbol (unlike e.g. Arithmetic coding), ternary Huffman should be a little bit more suboptimal because it has to allocate an integer number of trits. Not a show-stopper, especially for only 3, but it does indicate that n-ary Huffman will fall further behind other coding algorithms as you increase n.

Seymore answered 27/3, 2011 at 21:27 Comment(2)
I don't quite understand why you say, "... n-ary Huffman will fall further behind other coding algorithms as you increase n". If you could elaborate on why, that'd be great!Absorb
@AnishRamaswamy Huffman coding is slightly suboptimal (when compared to e.g. range or arithmetic coding) because every codeword is an integer number of symbols (bits). That number of bits is never less than the ideal entropy of that codeword, but sometimes more. For ternary or higher n, the amount of information represented by each symbol goes up as n goes up, so the waste from "rounding up" to a whole number of symbols would also tend to increase.Seymore
S
4

As an empirical test, I constructed binary and trinary Huffman trees for the distribution of Scrabble tiles.

The entropy of the distribution shows you can't get better than 4.37 bits per letter.

The binary Huffman tree uses on average 4.41 bits per letter.

The trinary Huffman tree uses on average 2.81 trits per letter, which has the same information density as 4.45 bits per letter.

Storytelling answered 28/3, 2011 at 0:37 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.