I'm thinking of using a Huffman code to compress text, but with symbols of variable length (strings). For example (using an underscore as a space):
huffman-code | symbol
------------------------------------
00 | _
01 | E
100 | THE
101 | A
1100 | UP
1101 | DOWN
11100 | .
11101 |
1111...
(etc...)
How can I construct the frequency table? Obviously there are some overlapping issues, the sequence _TH
would appear neary as often as THE
, but would be useless in the table (both _
and THE
have short huffman code).
Does such an algorithm exists? Does it have a special name? What would be the tricks to generate the frequency table? Do I need to tokenize the input? I did not found anything in the litterature / web. (All this make me think also of radix trees).
I was thinking of using an iterative process:
- Generate an huffman tree for all symbols of length 1 to N
- Remove from the tree all symbols with N>1 and below a certain count threshold
- Regenerate a second huffman tree, but this time tokenizing the input with the previous one (probably using a radix tree for lookup)
- Repeat to 1 until we converge (or for a few times)
But I can't figure out how can I prevent the problem of overlaps (_TH
vs THE
) with this.