Improving scalability of the modified preorder tree traversal algorithm

Asked 26/6, 2009 at 15:25 Answered 28/6, 2009 at 21:1

I've been thinking about the modified preorder tree traversal algorithm for storing trees within a flat table (such as SQL).

One property I dislike about the standard approach is that to insert a node you have to touch (on average) N/2 of the nodes (everything with left or right higher than the insert point).

The implementations I've seen rely on sequentially numbered values. This leaves no room for updates.

This seems bad for concurrency and scaling. Imagine you have a tree rooted at the world containing user groups for every account in a large system, it's extremely large, to the point you must store subsets of the tree on different servers. Touching half of all the nodes to add a node to the bottom of the tree is bad.

Here is the idea I was considering. Basically leave room for inserts by partitioning the keyspace and dividing at each level.

Here's an example with N_max = 64 (this would normally be the MAX_INT of your DB)

                     0:64
              ________|________
             /                 \
         1:31                   32:63
        /    \                 /     \
    2:14    15-30         33:47       48:62

Here, a node is added to the left half of the tree.

                     0:64  
              ________|________
             /                 \
         1:31                  32:63
      /   |   \               /     \
  2:11  11:20  21:30     33:47       48:62

The alogorithm must be extended for the insert and removal process to recursively renumber to the left/right indexes for the subtree. Since querying for immediate children of a node is complicated, I think it makes sense to also store the parent id in the table. The algorithm can then select the sub tree (using left > p.left && right < p.right), then use node.id and node.parent to work through the list, subdividing the indexes.

This is more complex than just incrementing all the indexes to make room for the insert (or decrementing for removal), but it has the potential to affect far fewer nodes (only decendenants of the parent of the inserted/removed node).

My question(s) are basically:

Has this idea been formalized or implemented?
Is this the same as nested intervals?

Adrienadriena answered 26/6, 2009 at 15:25 Comment(0)

I have heard of people doing this before, for the same reasons, yes.

Note that you do lose at a couple of small advantages of the algorithm by doing this

normally, you can tell the number of descendants of a node by ((right - left + 1) div 2). This can occasionally be useful, if e.g. you'd displaying a count in a treeview which should include the number of children to be found further down in the tree
Flowing from the above, it's easy to select out all leaf nodes -- WHERE (right = left + 1).

These are fairly minor advantages and may not be useful to you anyway, though for some usage patterns they're obviously handy.

That said, it does sound like materialized paths may be more useful to you, as suggested above.

Hufnagel answered 28/6, 2009 at 21:1 Comment(0)

I think you're better off looking at a different way of storing trees. If your tree is broad but not terribly deep (which seems likely for the case you suggested), you can store the complete list of ancestors up to the root against each node. That way, modifying a node doesn't require touching any nodes other than the node being modified.

Teary answered 27/6, 2009 at 10:40 Comment(1)

FYI, this approach is referred to as 'materialized paths'. I believe this might be the safest approach for is for the time being. It's certainly easier to code correctly (less complex). I'm still researching though :-) – Adrienadriena 28/6, 2009 at 4:18

You can split your table into two: the first is (node ID, node value), the second (node ID, child ID), which stores all the edges of the tree. Insertion and deletion then become O(tree depth) (you have to navigate to the element and fix what is below it).

The solution you propose looks like a B-tree. If you can estimate the total number of nodes in your tree, then you can choose the depth of the tree beforehand.

Bullfight answered 26/6, 2009 at 15:53 Comment(1)

This is the adjacency list model. The problem with it is that to read the entire tree also takes O(tree depth) queries. Nested sets improve on that significantly (single query). – Adrienadriena 26/6, 2009 at 18:3

Recommended topics

Hot tags