Prefix Codes
As you pointed out, a Prefix Code is one where a given code is not a prefix for any other given code.
This is a very general definition. A Huffman encoding is a restricted form of Prefix Code.
A common usage for Huffman coding is to minimize (optimize) the total bit count needed to encode a "message".
A "message" is typically a sequence of symbols and it is encoded by mapping each symbol occurrence to
a specific prefix code and writing out the prefix code in its place. Any set of prefix codes could be used
to do this. But, a Huffman encoding will result in the shortest possible message based on bit count.
For example the ASCII character set could be considered as a mapping of symbols to a set of 8 bit prefix codes.
This could even be considered a Huffman encoding provided that the encoded message contained exactly the
same number of each possible symbol.
The interesting stuff starts when the message to be encoded contains symbol frequencies that are unequal. At this
point one can reduce the total bit length of the message by using prefix codes of different lengths. Use short
prefix codes for more frequent symbols and longer prefix codes for less frequent symbols.
From your example there are 8 symbols to encode. Symbols mapped to prefix codes '11' and '10' would be the most
frequent symbols in the message. Likewise, symbols mapped to '0111', '0110', '1010' and '0100' would be least frequent.
Higher the frequency the shorter the prefix code.
The "trick" in creating a Huffman coding is to build the set of Prefix Codes such that after mapping
each symbol in the message to their associated prefix codes the message contains as few bits as possible.
I find it useful to view prefix codes as a binary tree where each leaf node maps to a symbol.
For example, the binary tree corresponding to the prefix codes given in your question (01, 11, 000, 001,
0100, 0101, 0110, 0111) would be:
+-- (11)
+--+
| +-- (10)
|
| +-- (0111)
--+ +--+
| | +-- (0110)
| +--+
| | | +-- (0101)
| | +--+
+--+ +-- (0100)
|
| +-- (001)
+--+
+-- (000)
To get the values in brackets you just assign a '1' when the top edge is followed or a '0' if the bottom
edge is followed.
How to build such a tree?
Start with data structures to represent a binary tree and a list.
The binary tree will contain two types of node. 1) A leaf node representing
a symbol and its frequency and 2) an internal node representing the cumulative frequency
of all the nodes below it (it also needs two pointers, one for the left branch and one for the right branch).
The list contains an ordered set of nodes from the binary tree. Nodes in the list are ordered
based on the frequency value of the node they point to. Lowest frequency nodes occur at the front of the list
and increase toward the end of the list. A linked list of pointers to tree nodes might be a useful
implementation - but any ordered list structure will do.
The algorithm below employs two lists: a "reference" and a "working" list. As nodes are
processed from the "reference" list new nodes are created and inserted into the "working" list such that
the "working" list remains ordered by node frequency.
Use these data structures and the following algorithm to create a Huffman encoding.
0. Initialize the "reference" list by creating a leaf node for each symbol
then add it into this list such that nodes with the lowest frequency
occur at the front of the list and those with the highest frequency
occur at the back (basically a priority queue).
1. Initialize the "working" list to empty.
2. Repeat until "reference" list contains 1 node
2.1 Set MaxFrequency to the sum of the first 2 node frequencies
2.1 Repeat until "reference" list is empty
If ("reference" list contains 1 node) OR
(sum of the next two nodes frequency > MaxFrequency)
Move remaining nodes to the "working" list
Set "reference" list to empty
Else
Create a new internal node
Connect the first "reference" node to the left child
Connect the second "reference" node to the right child
Set the new node frequency to the sum of the frequencies of the children
Insert the new node into the "working" list
Remove the first and second nodes from the "reference" list
2.2 Copy the "working" list to the "reference" list
2.3 Set the "working" list to empty
At the end of this process the single "reference" list item will be the root of a Huffman tree. You can enumerate
prefix codes by doing a depth first traversal of the tree. Write out a '0' for every left branch
taken and a '1' for every right branch. The code is complete when a leaf is encountered. The symbol at the
leaf is encoded by the Huffman code just generated.
What is an optimum encoding
An interesting calculation one can perform is to calculate the "bit weight" of a prefix encoding. The bit weight
is the total number of bits needed to represent the set of prefix codes.
Look at your original tree above. The weight of this tree is
(2 bits * 2) + (4 bits * 5) + (3 bits * 2) = 30 bits. You used 30 bits to represent 8 prefix values. What
is the minimal number
of bits you could have used? Think about it, as a tree becomes unbalanced the length of the path to some leaves gets
longer - this adds to the weight. For example the worst case for a 4 value prefix tree would be:
+-- (1 bit)
--+
| +-- (2 bits)
+--+
| +-- (3 bits)
+--+
+-- (3 bits)
giving a total weight of (1 bit * 1) + (2 bits * 1) + (3 bits * 2) = 9 bits
Balance the tree:
+-- (2 bits)
+--+
| +-- (2 bits)
--+
| +-- (2 bits)
+--+
+-- (2 bits)
giving a total weight of (2 bits * 4) = 8 bits. Notice that for balanced trees all prefix codes end up
having the same number of bits.
Tree
bit weight is just the sum of the path lengths to all leaves. You minimize the bit weight
by minimizing the total path length - and this is done by balancing the tree.
As you can see, there isn't much value in minimizing any given prefix tree, you just end up with a fixed length
symbol encoding. The value comes when you consider the bit weight of the resulting encoded message. Minimizing
that leads to Huffman encoding.
How many different encodings are there?
Prefix codes may be generated by traversing a binary tree and emitting a '0' for each lower branch followed
and a '1' for each upper branch followed until a leaf is encountered. As in:
+--+ (1)
|
--+
| +-- (01)
+--+
+-- (00)
Alternatively we could "flip" that rule and assign a '1' for each lower branch and a '0'
for the upper branches:
+-- (0)
|
--+
| +-- (10)
+--+
+-- (11)
These generate two different sets of prefix codes. Addtitional sets can be generated by
going through all the possible 1/0 assignments to branches and then traversing the tree.
This will give you 2^n sets. But if you do this, you will find the same prefix codes may be generated but
in different order. For example, the previous tree would yield the following sets: {(0, 10, 11), (0, 11, 01),
(1, 01, 00), (1, 00, 01)}. Then flip the tree to:
+-- (??)
+--+
| +-- (??)
--+
|
+-- (?)
and you get: {(11, 10, 0), (10, 11, 0), (01, 00, 1), (00, 01, 1)}. Put them both together for 2^3 = 8 sets. However if you want unique sets disregarding order there are only 2 sets: {(0, 10, 11), (1, 00, 01)}.
Go through the same exercise for a balanced tree and there is only ever 1 set. All this
leads me to believe that the number of unique encodings is related to the balance structure of the tree
used to generate prefix codes. Unfortunately, I don't have an exact formula or calculation worked out. On a hunch I would guess the number would be 2^(number of distinct code lengths - 1). For a balanced tree that is: 2^(1 - 1) = 1; for a tree with two distinct code lengths (as in the example above): 2^(2 - 1) = 2; and for your example: 2^(3 - 1) = 4.
n
or just powers of two? There's a much simpler way in the latter case. – Reciprocation