How can two hierarchies be efficiently merged in SQL Server?
Asked Answered
R

3

8

I have two tables with hierarchyid fields, one of which is a staging table with new data that needs to be merged into the other (that is, a set of nodes that need to be added to the main tree, some of which might already be there).

In addition to the hierarchyid column that defines the tree structure (parent/child relationships). each table has a separate column which contains a node identifier that uniquely identifies each node. That is, the way to tell if a node from the staging table is already in the main table is via the node ID, not via the hierarchyid columns.

Imperatively, the processing that needs to be performed would look something like this:

For each row, RS, in the staging table:
    If there is not already a row with the same Id as RS in the main table:
         Find the parent, PS, of the staging row
         Find the row, PM, in the main table that has the same node ID as PS
         Create a new child, RM of row PM
         Set PM's ID equal to the ID of RS

Importantly, this approach will only work if the tree in the staging table is sorted/traversed in breadth-first order - this is so that when RS is encountered, it is guaranteed that its parent PS already has a corresponding row in the main table.

So far the only way I can see to achieve this in SQL server is to use a cursor over the staging table (which is already sorted) and call a stored procedure for each row that essentially does exactly what is described above, complete with a SELECT MAX() to find the highest hierarchyid that already exists as a child of PM, so that the child can be added uniquely.

This is an awfully inefficient approach, though, and waaaay too slow for my purposes. Is there any better way?

For background, this is kind of a feasibility check I'm doing. I need to figure out whether I can perform this operation quickly inside SQL Server. If it turns out I can't I will have to do it another way, outside the database. The merging of the trees is inherent to (actually, in a sense kind of is) the problem domain, so structuring the data differently or taking a wider view and trying to somehow avoid performing this operation altogether is not an option.

Update

As requested, here's a concrete example.

Tables "staging" and "main" both have the same two columns:

   hierarchy_id of type hierarchyid
   node_id of type bigint

Initial contents

main:

 hierarchy_id    node_id
 /1/             1
 /1/1/           2
 /1/2/           3
 /1/3/           4

staging:

 hierarchy_id    node_id
 /1/             1
 /1/1/           3
 /1/2/           5
 /1/1/1/         6

Desired contents

main:

 hierarchy_id    node_id
 /1/             1
 /1/1/           2
 /1/2/           3
 /1/3/           4
 /1/4/           5
 /1/2/1/         6

Note that the node in the staging table with hierarchy_id /1/1/ corresponds to that with hiearchy_id /1/2/ in the target table (which is why the node_id is important - can't just copy hierarchy_id values across). Also note that the new node with node_id 6 is added as a child of the correct parent, the one with node_id 3, which is why the hierarchy_id is important - it defines the tree structure (parent/child relationships) for any new nodes. Any solution needs to take account of both aspects.

Rain answered 14/8, 2011 at 17:2 Comment(10)
Would a valid first step be to remove all rows that already have the same Node ID existing in the target table? Seems like that would eliminate the need for dupe checking during the merge processMclin
@OMG Ponies, there isn't a child table and a parent table. There are two tables, each with a hierarchyid column (msdn.microsoft.com/en-us/library/bb677290.aspx). The node ID establishes equivalence of nodes in the two tables, but where a node doesn't already exist in the target table it isn't enough just to copy its node ID across - it also needs to be added to the appropriate place in the hierarchy defined by the target table's hierarchyid column.Rain
Incidentally, I'm open to solutions that don't use hierarchyid. However, my understanding is that handling arbitrarily deep trees without them requires Common Table Expressions, which seem syntactically hideous and the sources I've read suggest they are slower than hierarchyids.Rain
@Neil N, I could remove such records from the source table, though it might complicate some of the stuff I have to do later. The dupe checking isn't really the problem, though - if I can achieve this with a MERGE statement that would be great.Rain
I think it would help if you gave us an example table structure and expected outcome. I'm having trouble understanding why you can't do this with a simple MERGE statement; are you not allowed to change the hierarchyid or would you have duplicate hierarchyids that way? I've built a self-maintaining hierarchyid/adjacency list hybrid that might be what you're looking for, but it's a little hard to tell from the question.Sumptuary
@Aaronought - I've added an example. It may well be possible to do it with a simple MERGE statement (I'm no expert in SQL). However, I can't simply merge the hierarchyid column. It wouldn't be valid to copy the hierarchyid values across, as the ones in the staging table only define the tree structure within the staging table. Essentially, for nodes that already exist, nothing needs doing. Any branch that doesn't already exist needs to be copied with its hierarchical structure intact and spliced onto the matching node in the target. Hope this makes it clearer - it's tricky to describe.Rain
Does the staging table only have "additions" of nodes or could it be used to redefine a parent for an already existing node? Like adding a new root node for the entire tree?".Ftlb
@Mikael Eriksson, the staging table will contain only new branches and nodes that already exist in the target table. Further, there will be no parentless nodes in the staging table. Every node in the staging table will have an intact path of nodes up to the root present along with it.Rain
Hmm. Ok I will try ask the question again because I am still not sure. Can there be node in the staging table that already exist in the target table but the parent node for that node is different in staging and target? That means that the existing node in the target should be updated with a new parent?Ftlb
Ah, sorry - misunderstood your question. No, that is not possible.Rain
N
3

Modeling your hierarchy in this way will lead to problems. The hierarchy_id column violates first normal form and the process for merging is going to be prone to update anomalies if you don't serialize/bottleneck access.

You should consider a table with just node_id and parent_id, see how it trivializes your merge problem

node_id   parent_id
1         NULL
2         1
3         2
4         3

node_id   parent_id
1         NULL
3         1
5         2
6         1

You would use recursive queries with this and you might be surprised how efficient the execution plans turn out. If you must have the flattened hierarchy column you can probably create an indexed view with a recursive query.

Naaman answered 14/8, 2011 at 18:33 Comment(2)
Interesting thought. My merges are serialized so I'm not too worried on that score, but that does look like it would make the merging much easier. I guess it comes at the cost of making the querying trickier and slower. I'll give it a try, thanks!Rain
That's certainly much faster, though not fast enough for my use case. At least with this approach I think I'm not artificially hobbling SQL server. If a straight merge like this is too slow, I suspect I have to look elsewhere.Rain
M
3

We have been working on a product that required a similar solution. After a lot of research on this and other approaches, we concluded that hierarchyID method is not for us.

So as a direct answer to your question: There is no better way to do this using this approach.

Take a look at Nested Set Models and at Adjacency List Model.

Both of these are far more elegant and efficient solutions to this particular design challenge.

Edit: As an after-thought, in-case you are not married to SQL - this problem can be much better solved using a non-relational database. We could not go that way since no one has enough expertise in designing non-relational databases, but in case SQL is optional then you can use your current approach in a much nicer, more efficient way in MongoDB for example.

Marita answered 14/8, 2011 at 18:9 Comment(4)
Thanks - I am leaning this way. I have a prototype using a non-relational database (based on HDF5), but as it's a conservative corporate environment I want to make sure I've given the more traditional/supported technologies a fair crack of the whip.Rain
Thanks for the links, btw. The nested set model looks problematic for me as it will require renumbering the target tree each time a new set of nodes is merged in. For the adjacency list model to allow arbitrarily deep hierarchies (which I have), I believe I would need Common Table Expressions, which look hideous. The documentation I've seen on hierarchy IDs claims they are faster than Common Table Expressions.Rain
Hey,i hear you man. Conservative clients kill my time. Why is it problematic to renumber when merging? Also, see this link on joe's adjacency lists. The create of the schema has written a book, an excerpt from which describes the fixes that you can apply to the simple adjacency schema to make it more durable. Also, I see no need for CTEs here, am I missing something?Marita
@Tom: If you have to stick to this, I have a working solution which I am not very proud of. I kept a single table. No auto-increment PK. No Parent ID, no child ID. It was simple. I store nodeID as a string which comes from a regular ordered list. So I had nodeID = 1, nodeID = 1.1, nodeID = 1.1.1, nodeID = 1.1.2, etc. This was done using what I found mind-numbing (At that time) queries. Maybe you can work it :)Marita
N
3

Modeling your hierarchy in this way will lead to problems. The hierarchy_id column violates first normal form and the process for merging is going to be prone to update anomalies if you don't serialize/bottleneck access.

You should consider a table with just node_id and parent_id, see how it trivializes your merge problem

node_id   parent_id
1         NULL
2         1
3         2
4         3

node_id   parent_id
1         NULL
3         1
5         2
6         1

You would use recursive queries with this and you might be surprised how efficient the execution plans turn out. If you must have the flattened hierarchy column you can probably create an indexed view with a recursive query.

Naaman answered 14/8, 2011 at 18:33 Comment(2)
Interesting thought. My merges are serialized so I'm not too worried on that score, but that does look like it would make the merging much easier. I guess it comes at the cost of making the querying trickier and slower. I'll give it a try, thanks!Rain
That's certainly much faster, though not fast enough for my use case. At least with this approach I think I'm not artificially hobbling SQL server. If a straight merge like this is too slow, I suspect I have to look elsewhere.Rain
F
0

Here is a solution that moves the rows from source @S to target @T one level at a time. To simplify a bit I added a root node just to always have a parent present that is used when creating the new HierarcyID.

I have never used HierarchyID so there might absolutely be more efficient ways to do this but it should at least be more efficient than doing it one row at a time.

-- Target table
declare @T table 
(
  hierarchy_id hierarchyid primary key,
  node_id bigint
)

insert into @T values
('/',             0), -- Needed for simplicity
('/1/',           1),
('/1/1/',         2),
('/1/2/',         3),
('/1/3/',         4)

-- Source table
declare @S table 
(
  hierarchy_id hierarchyid primary key,
  node_id bigint
)

insert into @S values
('/',               0),
('/1/',             1),
('/1/1/',           3),
('/1/2/',           5),
('/1/1/1/',         6)

declare @lvl int = 1

-- Move rows from @S to @T for each level
while exists(select *
             from @S
             where hierarchy_id.GetLevel() = @lvl)
begin

  insert into @T
  select T.hierarchy_id.GetDescendant(C.MaxID, null),
         S.node_id
  from (select S1.node_id, S2.node_id as ParentID
        from @S as S1
          inner join @S as S2
            on S1.hierarchy_id.GetAncestor(1) = S2.hierarchy_id
        where S1.hierarchy_id.GetLevel() = @lvl and
              S1.node_id not in (select node_id
                                 from @T)
       ) as S
    inner join @T as T
      on S.ParentID = T.node_id
    outer apply (select max(hierarchy_id) as MaxID
                 from @T as T2
                 where T.hierarchy_id = T2.hierarchy_id.GetAncestor(1)) as C       

    set @lvl = @lvl + 1
end

select *, hierarchy_id.ToString()
from @T
where hierarchy_id <> hierarchyid::GetRoot()  

Result:

hierarchy_id  node_id  (No column name)
------------  -------  ----------------
0x58          1        /1/
0x5AC0        2        /1/1/
0x5B40        3        /1/2/
0x5B56        6        /1/2/1/
0x5BC0        4        /1/3/
0x5C20        5        /1/4/
Ftlb answered 15/8, 2011 at 4:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.