SQL - How to store and navigate hierarchies?
Asked Answered
S

9

49

What are the ways that you use to model and retrieve hierarchical info in a database?

Sheepshearing answered 2/9, 2008 at 4:10 Comment(0)
N
16

The definitive pieces on this subject have been written by Joe Celko, and he has worked a number of them into a book called Joe Celko's Trees and Hierarchies in SQL for Smarties.

He favours a technique called directed graphs. An introduction to his work on this subject can be found here

Nook answered 7/9, 2008 at 10:33 Comment(4)
I thought people call the technique "nested sets".Unideaed
more detailed explanation of this technique is at dev.mysql.com/tech-resources/articles/hierarchical-data.htmlDigamma
The problem I see with the Nested Sets Model is that every time you add or remove a node you have to update the whole right of the tree. That does not sound very scalable to me. Are there any benchmarks? I think filesystems do not use the Nested Sets Model to prevent traversing the whole tree, so they use the Adjacency List Model instead. @Quassnoi explains it quite well: explainextended.com/2009/09/24/…Spoil
In the book Joe explains various techniques which can reduce the frequency of tree adjustments. One such example is covered in "Chapter 5 Frequent Insertion Trees". The general technique is an application of large spreads between the left / right value at each node, allowing it's number of subordinates to change without making adjustments to the rest of the tree. This is particularly useful for very fast changing, deep hierarchies.Aglitter
O
31

I like the Modified Preorder Tree Traversal Algorithm. This technique makes it very easy to query the tree.

But here is a list of links about the topic which I copied from the Zend Framework (PHP) contributors webpage (posted there by Posted by Laurent Melmoux at Jun 05, 2007 15:52).

Many of the links are language agnostic:

There is 2 main representations and algorithms to represent hierarchical structures with databases :

  • nested set also known as modified preorder tree traversal algorithm
  • adjacency list model

It's well explained here:

Here are some more links that I've collected:

adjacency list model

nested set

Graphes

Classes :

Nested Sets DB Tree Adodb

Visitation Model ADOdb

PEAR::DB_NestedSet

PEAR::Tree

nstrees

Ottie answered 19/9, 2008 at 5:46 Comment(0)
N
16

The definitive pieces on this subject have been written by Joe Celko, and he has worked a number of them into a book called Joe Celko's Trees and Hierarchies in SQL for Smarties.

He favours a technique called directed graphs. An introduction to his work on this subject can be found here

Nook answered 7/9, 2008 at 10:33 Comment(4)
I thought people call the technique "nested sets".Unideaed
more detailed explanation of this technique is at dev.mysql.com/tech-resources/articles/hierarchical-data.htmlDigamma
The problem I see with the Nested Sets Model is that every time you add or remove a node you have to update the whole right of the tree. That does not sound very scalable to me. Are there any benchmarks? I think filesystems do not use the Nested Sets Model to prevent traversing the whole tree, so they use the Adjacency List Model instead. @Quassnoi explains it quite well: explainextended.com/2009/09/24/…Spoil
In the book Joe explains various techniques which can reduce the frequency of tree adjustments. One such example is covered in "Chapter 5 Frequent Insertion Trees". The general technique is an application of large spreads between the left / right value at each node, allowing it's number of subordinates to change without making adjustments to the rest of the tree. This is particularly useful for very fast changing, deep hierarchies.Aglitter
S
11

What's the best way to represent a hierachy in a SQL database? A generic, portable technique?

Let's assume the hierachy is mostly read, but isn't completely static. Let's say it's a family tree.

Here's how not to do it:

create table person (
person_id integer autoincrement primary key,
name      varchar(255) not null,
dob       date,
mother    integer,
father    integer
);

And inserting data like this:

person_id   name      dob       mother father  
1           Pops      1900/1/1   null   null  
2           Grandma   1903/2/4   null   null  
3           Dad       1925/4/2   2      1  
4           Uncle Kev 1927/3/3   2      1
5           Cuz Dave  1953/7/8   null   4
6           Billy     1954/8/1   null   3

Instead, split your nodes and your relationships into two tables.

create table person (
person_id integer autoincrement primary key,
name      varchar(255) not null,
dob       date
);

create table ancestor (
ancestor_id   integer,
descendant_id integer,
distance      integer
);

Data is created like this:

person_id   name      dob       
1           Pops      1900/1/1  
2           Grandma   1903/2/4   
3           Dad       1925/4/2   
4           Uncle Kev 1927/3/3
5           Cuz Dave  1953/7/8   
6           Billy     1954/8/1   

ancestor_id  descendant_id  distance
1            1              0
2            2              0
3            3              0
4            4              0
5            5              0
6            6              0
1            3              1
2            3              1
1            4              1
2            4              1
1            5              2
2            5              2
4            5              1
1            6              2
2            6              2
3            6              1

you can now run arbitary queries that don't involve joining the table back on itself, which would happen if you have the heirachy relationship in the same row as the node.

Who has grandparents?

select * from person where person_id in 
    (select descendant_id from ancestor where distance=2);

All your descendants:

select * from person where person_id in 
    (select descendant_id from ancestor 
    where ancestor_id=1 and distance>0);

Who are uncles?

select decendant_id uncle from ancestor 
    where distance=1 and ancestor_id in 
    (select ancestor_id from ancestor 
        where distance=2 and not exists
        (select ancestor_id from ancestor 
        where distance=1 and ancestor_id=uncle)
    )

You avoid all the problems of joining a table to itself via subqueries, a common limitation is 16 subsuqeries.

Trouble is, maintaining the ancestor table is kind of hard - best done with a stored procedure.

Sanfo answered 2/9, 2008 at 4:13 Comment(2)
Fine for some purposes, and I agree that nodes and edges should be kept in separate tables - but tracking how every node connects to every descendent/ancestor will not scale well for large, complex graphs - what's presented is a denormalized (for 'distance' & 'vertical' connectivity) optimization that may not be appropriate for most. See Telcontar's answer.Sundew
This is my favorite way to model hierarchies until the data gets huge. It is simple and very flexible.Pater
H
10

I've got to disagree with Josh. What happens if you're using a huge hierarchical structure like a company organization. People can join/leave the company, change reporting lines, etc... Maintaining the "distance" would be a big problem and you would have to maintain two tables of data.

This query (SQL Server 2005 and above) would let you see the complete line of any person AND calculates their place in the hierarchy and it only requires a single table of user information. It can be modified to find any child relationship.

--Create table of dummy data
create table #person (
personID integer IDENTITY(1,1) NOT NULL,
name      varchar(255) not null,
dob       date,
father    integer
);

INSERT INTO #person(name,dob,father)Values('Pops','1900/1/1',NULL);  
INSERT INTO #person(name,dob,father)Values('Grandma','1903/2/4',null);
INSERT INTO #person(name,dob,father)Values('Dad','1925/4/2',1);
INSERT INTO #person(name,dob,father)Values('Uncle Kev','1927/3/3',1);
INSERT INTO #person(name,dob,father)Values('Cuz Dave','1953/7/8',4);
INSERT INTO #person(name,dob,father)Values('Billy','1954/8/1',3);

DECLARE @OldestPerson INT; 
SET @OldestPerson = 1; -- Set this value to the ID of the oldest person in the family

WITH PersonHierarchy (personID,Name,dob,father, HierarchyLevel) AS
(
   SELECT
      personID
      ,Name
      ,dob
      ,father,
      1 as HierarchyLevel
   FROM #person
   WHERE personID = @OldestPerson

   UNION ALL

   SELECT
    e.personID,
      e.Name,
      e.dob,
      e.father,
      eh.HierarchyLevel + 1 AS HierarchyLevel
   FROM #person e
      INNER JOIN PersonHierarchy eh ON
         e.father = eh.personID
)

SELECT *
FROM PersonHierarchy
ORDER BY HierarchyLevel, father;

DROP TABLE #person;
Hessian answered 2/9, 2008 at 5:20 Comment(1)
Recursive queries aren't limted to SQL Server. Postgres, Oracle, DB2, Firebird and many more can do the same thing.Goar
C
4

FYI: SQL Server 2008 introduces a new HierarchyID data type for this sort of situation. Gives you control over where in the "tree" your row sits, horizontally as well as vertically.

Cragsman answered 2/9, 2008 at 4:15 Comment(0)
F
3

Oracle: SELECT ... START WITH ... CONNECT BY

Oracle has an extension to SELECT that allows easy tree-based retrieval. Perhaps SQL Server has some similar extension?

This query will traverse a table where the nesting relationship is stored in parent and child columns.

select * from my_table
    start with parent = :TOP
    connect by prior child = parent;

http://www.adp-gmbh.ch/ora/sql/connect_by.html

Fungicide answered 2/9, 2008 at 4:18 Comment(0)
L
2

I prefer a mix of the techinques used by Josh and Mark Harrison:

Two tables, one with the data of the Person and other with the hierarchichal info (person_id, parent_id [, mother_id]) if the PK of this table is person_id, you have a simple tree with only one parent by node (which makes sense in this case, but not in other cases like accounting accounts)

This hiarchy table can be transversed by recursive procedures or if your DB supports it by sentences like SELECT... BY PRIOR (Oracle).

Other posibility is if you know the max deep of the hierarchy data you want to mantain is use a single table with a set of columns per level of hierarchy

Ligroin answered 2/9, 2008 at 6:54 Comment(0)
L
1

We had the same issue when we implemented a tree component for [fleXive] and used the nested set tree model approach mentioned by tharkun from the MySQL docs.

In addition to speed things (dramatically) up we used a spreaded approach which simply means we used the maximum Long value for the top level right bounds which allows us to insert and move nodes without recalculating all left and right values. Values for left and right are calculated by dividing the range for a node by 3 und use the inner third as bounds for the new node.

A java code example can be seen here.

Lanthanum answered 24/11, 2008 at 13:16 Comment(0)
H
0

If you're using SQL Server 2005 then this link explains how to retrieve hierarchical data.

Common Table Expressions (CTEs) can be your friends once you get comfortable using them.

Hessian answered 2/9, 2008 at 4:18 Comment(1)
CTE is not so friendly finding children of parents in the performance arena. However finding parents from child is acceptableEarwitness

© 2022 - 2024 — McMap. All rights reserved.