Since you already have to implement code to handle a bit-wise layer on top of your byte-organized stream/file, here's my proposal.
Do not store the actual frequencies, they're not needed for decoding. You do, however, need the actual tree.
So for each node, starting at root:
- If leaf-node: Output 1-bit + N-bit character/byte
- If not leaf-node, output 0-bit. Then encode both child nodes (left first then right) the same way
To read, do this:
- Read bit. If 1, then read N-bit character/byte, return new node around it with no children
- If bit was 0, decode left and right child-nodes the same way, and return new node around them with those children, but no value
A leaf-node is basically any node that doesn't have children.
With this approach, you can calculate the exact size of your output before writing it, to figure out if the gains are enough to justify the effort. This assumes you have a dictionary of key/value pairs that contains the frequency of each character, where frequency is the actual number of occurrences.
Pseudo-code for calculation:
Tree-size = 10 * NUMBER_OF_CHARACTERS - 1
Encoded-size = Sum(for each char,freq in table: freq * len(PATH(char)))
The tree-size calculation takes the leaf and non-leaf nodes into account, and there's one less inline node than there are characters.
SIZE_OF_ONE_CHARACTER would be number of bits, and those two would give you the number of bits total that my approach for the tree + the encoded data will occupy.
PATH(c) is a function/table that would yield the bit-path from root down to that character in the tree.
Here's a C#-looking pseudo-code to do it, which assumes one character is just a simple byte.
void EncodeNode(Node node, BitWriter writer)
{
if (node.IsLeafNode)
{
writer.WriteBit(1);
writer.WriteByte(node.Value);
}
else
{
writer.WriteBit(0);
EncodeNode(node.LeftChild, writer);
EncodeNode(node.Right, writer);
}
}
To read it back in:
Node ReadNode(BitReader reader)
{
if (reader.ReadBit() == 1)
{
return new Node(reader.ReadByte(), null, null);
}
else
{
Node leftChild = ReadNode(reader);
Node rightChild = ReadNode(reader);
return new Node(0, leftChild, rightChild);
}
}
An example (simplified, use properties, etc.) Node implementation:
public class Node
{
public Byte Value;
public Node LeftChild;
public Node RightChild;
public Node(Byte value, Node leftChild, Node rightChild)
{
Value = value;
LeftChild = leftChild;
RightChild = rightChild;
}
public Boolean IsLeafNode
{
get
{
return LeftChild == null;
}
}
}
Here's a sample output from a specific example.
Input: AAAAAABCCCCCCDDEEEEE
Frequencies:
Each character is just 8 bits, so the size of the tree will be 10 * 5 - 1 = 49 bits.
The tree could look like this:
20
----------
| 8
| -------
12 | 3
----- | -----
A C E B D
6 6 5 1 2
So the paths to each character is as follows (0 is left, 1 is right):
- A: 00
- B: 110
- C: 01
- D: 111
- E: 10
So to calculate the output size:
- A: 6 occurrences * 2 bits = 12 bits
- B: 1 occurrence * 3 bits = 3 bits
- C: 6 occurrences * 2 bits = 12 bits
- D: 2 occurrences * 3 bits = 6 bits
- E: 5 occurrences * 2 bits = 10 bits
Sum of encoded bytes is 12+3+12+6+10 = 43 bits
Add that to the 49 bits from the tree, and the output will be 92 bits, or 12 bytes. Compare that to the 20 * 8 bytes necessary to store the original 20 characters unencoded, you'll save 8 bytes.
The final output, including the tree to begin with, is as follows. Each character in the stream (A-E) is encoded as 8 bits, whereas 0 and 1 is just a single bit. The space in the stream is just to separate the tree from the encoded data and does not take up any space in the final output.
001A1C01E01B1D 0000000000001100101010101011111111010101010
For the concrete example you have in the comments, AABCDEF, you will get this:
Input: AABCDEF
Frequencies:
- A: 2
- B: 1
- C: 1
- D: 1
- E: 1
- F: 1
Tree:
7
-------------
| 4
| ---------
3 2 2
----- ----- -----
A B C D E F
2 1 1 1 1 1
Paths:
- A: 00
- B: 01
- C: 100
- D: 101
- E: 110
- F: 111
Tree: 001A1B001C1D01E1F = 59 bits
Data: 000001100101110111 = 18 bits
Sum: 59 + 18 = 77 bits = 10 bytes
Since the original was 7 characters of 8 bits = 56, you will have too much overhead of such small pieces of data.