Managing hierarchies in SQL: MPTT/nested sets vs adjacency lists vs storing paths
Asked Answered
O

2

15

For a while now I've been wrestling with how best to handle hierarchies in SQL. Frustrated by the limitations of adjacency lists and the complexity of MPTT/nested sets, I began thinking about simply storing key paths instead, as a simple node_key/node_key/... string. I decided to compile the pros and cons of the three techniques:

Number of calls required to create/delete/move a node:

  • Adjacency = 1
  • MPTT = 3
  • Path = 1 (Replace old node path with new node path across all nodes that contain that path)

Number of calls required to get a tree:

  • Adjacency = [number of sub-levels]
  • MPTT = 1
  • Path = 1

Number of calls required to get path to a node / ancestry:

  • Adjacency = [number of super-levels]
  • MPTT = 1
  • Path = 0

Number of calls required to get number of subnodes:

  • Adjacency = [number of sub-levels]
  • MPTT = 0 (Can be calculated from right/left values)
  • Path = 1

Number of calls required to get depth of node:

  • Adjacency = [number of super-levels]
  • MPTT = 1
  • Path = 0

DB fields required:

  • Adjacency = 1 (parent)
  • MPTT = 3 (parent,right,left)
  • Path = 1 (path)

Conclusion

The stored path technique uses the same or less calls than the other techniques in every use case except one. By this analysis, storing paths is a clear winner. Not to mention, it's a lot simpler to implement, human readable, etc.

So the question is, shouldn't stored paths be considered a stronger technique than MPTT? Why are stored paths not a more commonly used technique, and why would you not use them over MPTT in a given instance?

Also, if you think this analysis is incomplete please let me know.

UPDATE:

Here are at least 2 things MPTT can do out of the box that a stored path solution won't:

  1. Allows calculation of subnode count for each node without any additional queries (mentioned above).
  2. Imposes an order on nodes at a given level. The other solutions are unordered.
Obla answered 19/11, 2011 at 18:18 Comment(3)
It is actually not true that you need [number of sub-levels] of calls in an adjacency model. You can do something like : SELECT * FROM adjacency a /* for($number_of_levels_required.. $prevname = $level > 1 ? 'a'+$level-1 : 'a'; */ INNER JOIN adjacency a/*$level*/ ON a/*$level*/.parent_id = /*$prevname*/.id /* endfor*/Martsen
How do you query a child & all descendants respectively, in a Path. I'd think the biggest drawback against Path is referential integrity as Bill Karwin mentions in his answer.Johannejohannes
@buffer - "How to query child & all descendants"? That's the easiest part of using paths: SELECT * FROM items WHERE path LIKE 'my/child/path%';. Yes, referential integrity suffers but the trade off is much more efficient queries.Obla
J
13

You might also consider the Closure Table design I describe in my answer to What is the most efficient/elegant way to parse a flat table into a tree?

Calls required to create/delete/move a node:

  • Closure = 1

Calls required to get a tree:

  • Closure = 1

Calls required to get path to a node / ancestry:

  • Closure = 1

Calls required to get number of subnodes:

  • Closure = 1

Calls required to get depth of node:

  • Closure = 1

DB fields required:

  • Adjancency = 1 more field / row
  • Path = 1 more field / row
  • MPTT = 2 or 3 more fields / row
  • Closure = 2 or 3 fields in extra table. This table has O(n^2) rows worst case but far fewer than that in most practical cases.

There are a couple of other considerations:

Supports unlimited depth:

  • Adjacency = yes
  • MPTT = yes
  • Path = no
  • Closure = yes

Supports referential integrity:

  • Adjacency = yes
  • MPTT = no
  • Path = no
  • Closure = yes

I also cover Closure Table in my presentation Models for Hierarchical Data with SQL and PHP, and my book, SQL Antipatterns Volume 1: Avoiding the Pitfalls of Database Programming.

Jaynajayne answered 19/11, 2011 at 19:18 Comment(11)
Why does path not support unlimited depth?Eggcup
Bill- Thanks, never heard about closure tables- Looks like very interesting stuff and I will investigate. One thing- why would you say Path technique doesn't support unlimited depth- are you saying its a function of the allowed string length in the field?Obla
Yes, the maximum depth of the Path design is constrained by the length of your string that you use to store the path. So for example if your node identifiers are 9-digit numbers, a VARCHAR(255) can store a path of depth 25 (assuming you use a one-character separator).Jaynajayne
@BillKarwin - He could just use a varchar(max), abstractly there is no limit (IMO), we could say the other ones are limited by the size of the disk and are not unlimited.Eggcup
But I acknowledge that the other designs also have finite limits on their depth, at least within the finite range of the data types you use for the node id's or MPTT left/right fields. And disk space, as you say. Still, the hard limit on path length is a constraint that the other designs do not have.Jaynajayne
Bill, do you advise to record only "immediate" parent-child relationships, or do you advise to record the entire closure ? The former requires recursive SQL when querying (not the easiest kind of thing to write, and maybe the majority of engines -the cheap ones- simply don't support it), and the latter implies a huge integrity enforcement problem.Madonia
The only design with no integrity fragility is Adjacency List (the "immediate parent" design). It depends on what brand of RDBMS you use. If you use Microsoft, Oracle 11, DB2, PostgreSQL or Apache Derby, you can run recursive CTE queries. If you run MySQL, SQLite, Informix, Firebird, etc. you don't have support for recursive queries. MPTT, Path Enumeration, and Closure all have some integrity enforcement issues.Jaynajayne
one of the big concern of closure tables is that it requires O(n^2) rows in a separate tableMicrostructure
@dnozay, yes, but only in the absolute worst case. Practical examples will be much less. I have a hierarchy with ~500k nodes, which needs only ~4.6m rows in the closure table. This is nowhere near 500k squared.Jaynajayne
Different example: a linked list of 500k elements would need roughly 500k * (500k + 1) / 2, that's about 125 billion rows; half of 500k squared.Microstructure
@dnozay, yes, but I said practical examples. If you are modeling linked lists of 500k elements, a transitive closure is probably not the right data structure. There are pros and cons to every solution.Jaynajayne
E
3

It problem with your conclusion is that it ignores most of the issues involved in working with trees.

By reducing the validity of a technique to the "number of calls" you effectively ignore all of the issues which well understood data structures and algorithms attempt to solve; that is, fastest execution and low memory and resource foot print.

The "number of calls" to an SQL server may seem like a good metric to use ("look ma less code"), but if the result is a program which never finishes, runs slowly, or takes up to much space, it is in fact a useless metric.

By storing the path with every node you are not creating a tree data structure. Instead you are creating a list. Any operation which a tree is designed to optimize is lost.

This might be hard to see with small date sets (and in many cases of small trees a list is better), try some examples on data sets of size 500, 1000, 10k -- You will quickly see why storing the whole path is not a good idea.

Eggcup answered 19/11, 2011 at 19:18 Comment(4)
@Hogan- what do you mean by "You will quickly see why storing the whole path is not a good idea"? What will I see? And "Any operation which a tree is designed to optimize is lost." What does that mean? These are vague statements- can you provide examples?Obla
@Obla - 3 year old post... I'll try, with a quick review: Storing the whole path creates a huge storage requirement increasing the storage size exponentially. In terms of storage a tree structure allows you to distribute the data over the nodes of a tree thus storing data only once for the permutations on a path though the tree.Eggcup
@Hogan- Assuming you mean in relation to adjacency list (both MPTT and closure techniques would require extra rows and therefore larger overall storage requirement). Still, I don't see the huge growth problem, it's just a single string field instead of an integer one. Even if you had depth-levels of 20 you'd still be able to fit all your tree paths comfortably inside a VARCHAR(255) up to 10 billion rows.Obla
@Obla - The last system I saw implementing a tree like this started to "die" at depth of about 6 and not more than 50,000 rows -- of course the programmer was not great and it was not my project so it might have just been incompetence (I was not able to analyse it in depth - pun intended). If your method is working, great.Eggcup

© 2022 - 2024 — McMap. All rights reserved.