We have to write the nodes of a binary tree to a file. What is the most space efficient way of writing a binary tree . We can store it in array format with parent in position i
and its children in 2i
, 2i+1
. But this will waste lot of space in case of sparse binary trees.
One method which I like is to store the preorder traversal, but also include the 'null' nodes in there. Storing the 'null' nodes removes the need for also storing the inorder of the tree.
Some advantages of this method
- You can do better storage than pre/post + inorder method in most practical cases.
- Serialization just takes one traversal
- Deserialization can be done in one pass.
- The inorder traversal can be gotten in one pass without constructing the tree, which might be useful if the situation calls for it.
For example say you had a binary tree of 64 bit integers, you can store an extra bit after each node saying whether the next is a null node or not (the first node is always the root). Null nodes, you can represent by a single bit.
So if there are n nodes, the space usage would be 8n bytes + n-1 indicator bits + n+1 bits for null nodes = 66*n bits.
In the pre/post + inorder you will end up using 16n bytes= 128*n bits.
So you save a space of 62*n bits over this pre/post + inorder method.
Consider the tree
100
/ \
/ \
/ \
10 200
/ \ / \
. . 150 300
/ \ / \
. . . .
where the '.' are the null nodes.
You will serialize it as 100 10 . . 200 150 . . 300 . .
Now each (including subtrees) 'preorder traversal with null' has the property that number of null nodes = number of nodes + 1.
This allows you to create the tree, given the serialized version in one pass, as the first node is the root of the tree. Nodes that follow are the left subtree followed by right, which can be viewed to be like this:
100 (10 . .) (200 (150 . .) (300 . .))
To create the inorder traversal, you use a stack and push when you see a node and pop (onto a list) when you see a null. The resulting list is the inorder traversal (a detailed explanation for this can be found here: C++/C/Java: Anagrams - from original string to target;).
The 2i, 2i+1 (Binary Heap) method is indeed the best way if you have a (nearly) complete tree.
Otherwise you won't escape storing a ParentId (parent Index) with each node.
Think about XML. It's a kind of tree serialization. For example:
<node id="1">
<node id="2"> 1
</node> / \
<node id="3"> 2 3
<node id="4"> / \
</node> 4 5
<node id="5">
</node>
</node>
</node>
Then, why the spaces and tags ? We can omit them, step by step:
<1>
<2></>
<3>
<4></>
<5></>
</>
</>
Remove the spaces: <1><2></2><3><4></><5></></></>
.
Remove the angle brackets: 12/34/5///
Now the problem is: what if a node has a empty left subtree and non-empty right subtree? Then we can use another special charactor, '#' to represent an empty left sub-tree.
For example:
1
/ \
2
/ \
3
This tree can be serialized as: 1#23///
.
You can save the in-order
and pre/post-order
traversal of the binary tree in the file and reconstruct the tree from these traversals.
Since only 2ⁿ-1
items can be stored without any space wasted, you'll have to split the N elements into subtrees, stored sequentially, and keeping a “master index” of the values between trees to decide in which tree to search in. An obvious way to decide which will be the subtrees, is to take advantage of the binary format of a number, and here I'm talking about N
, the count of items in the tree.
Let's assume you have a sorted sequence of 20 distinct items:
['aeu', 'bfz', 'cdi', 'dfc', 'eap', 'ggk', 'gsb', 'guj', 'idm', 'ieg',
'izr', 'pba', 'plp', 'rap', 'rhp', 'tat', 'uho', 'uwb', 'wdf', 'yhp']
20₁₀ is 10100₂ and the 1-bits decide the splits.
The first entry in your master table will be item[15], 0
(15₁₀≡01111₂ one less than the largest power of 2 less than or equal to N, and 0 being the length of your output list so far aka the start index for the subtree ). The previous 15 (0–14) items will be appended to your output list, again by taking advantage of binary arithmetic: item[7₁₀≡0111₂] at 0, item[3₁₀≡0011₂] at 1, item[11₁₀≡1001₂] at 2 and so on, thus preserving the rule that item[i] < item[2*i] && item[i] > item[2*i+1]
.
After this first step:
master_table = [('tat', 0)]
output_list = ['guj', 'dfc', 'pba', 'bfz', 'ggk', 'ieg', 'rap', 'aeu', 'cdi', 'eap', 'gsb', 'idm', 'izr', 'plp', 'rhp']
If N == 2ⁿ then this python generator produces the desired order:
def heap_order(pow2):
"""Return the tree order of items
pow2 MUST be a power of two"""
bit= pow2>>1
while bit:
yield from range(bit - 1, pow2, bit*2)
bit>>= 1
Delete (or ignore) the first 16 items from your list and repeat with the next 1-bit in the list length:
The remaining input list is ['uho', 'uwb', 'wdf', 'yhp']
, and its length is 4₁₀==100₂. Largest 2ⁿ-1 is 3, so you append zero-based item 3 along with the current length of the output list to your master table, and then append the previous 3 items to your output list resulting in:
master_table = [('tat', 0), ('yhp', 15)]
# master table items: (key, subtree_base_index)
output_list = [
'guj', 'dfc', 'pba', 'bfz', 'ggk',
'ieg', 'rap', 'aeu', 'cdi', 'eap',
'gsb', 'idm', 'izr', 'plp', 'rhp',
'uwb', 'uho', 'wdf'
]
The whole result is equivalent to a binary tree where all left subtrees have either leaf nodes or tree nodes with both children.
To search for a value: scan the master table to find the key larger than your search key, and search your key in the corresponding subtree, always using index - subtree_base_index
adjustments in your 2*index
, 2*index-1
calculations. The maximum total (master table + subtree) count of comparisons needed for failed search is usually ⌈log₂N⌉, and sometimes —when the search key is larger than the first entry of the master table— less than that (no subtrees are created for 0-bits of N).
You should also store the level count of each subtree in the master table (so that you know after how many searches you should stop searching the subtree).
You can store the level-order, aka DFS output, including any null children. When reconstructing the tree from the list of values and nulls, use a queue of nodes with the knowledge that the next two items in the list are always the left and right of the current node from the queue.
This still requires storing a total of n+1 "null signifiers", but deserializing is done iteratively as opposed to recursively in the case of preorder.
POC in JS:
class Node {
constructor(val, left=null, right=null) {
this.val = val;
this.left = left;
this.right = right;
}
}
function dfs(root) {
const queue = [root];
const output = [];
while (queue.length > 0) {
const curr = queue.pop();
// if curr is null, push null, else curr.val
output.push(curr && curr.val);
if (curr !== null) {
queue.unshift(curr.left);
queue.unshift(curr.right);
}
}
return output;
}
function undfs(list) {
const root = new Node(list.shift());
const queue = [root];
while (queue.length > 0) {
const curr = queue.shift();
const [leftVal, rightVal] = list.splice(0,2);
if (leftVal) {
curr.left = new Node(leftVal);
queue.push(curr.left);
}
if (rightVal) {
curr.right = new Node(rightVal);
queue.push(curr.right);
}
}
return root;
}
const root =
new Node(100,
new Node(10),
new Node(200,
new Node(150),
new Node(300)
)
);
const dfsOutput = dfs(root);
console.log("List output:");
console.log(dfsOutput.map(a => a || "null").join(", "));
console.log("\nTree output:");
console.log(undfs(dfsOutput));
© 2022 - 2024 — McMap. All rights reserved.