Calculate the percentage of the root owned by its parents
Asked Answered
G

2

13

In simplified terms, I'm trying to calculate the percentage of the root of a tree owned by its parents, further up the tree. How can I do this in SQL alone?

Here's my (sample) schema. Please note that though the hierarchy itself is quite simple there is an additional holding_id, which means that a single parent can "own" different parts of their child.

create table hierarchy_test ( 
       id number -- "root" ID
     , parent_id number -- Parent of ID
     , holding_id number -- The ID can be split into multiple parts
     , percent_owned number (3, 2)
     , primary key (id, parent_id, holding_id) 
        );

And some sample data:

insert all 
 into hierarchy_test values (1, 2, 1, 1) 
 into hierarchy_test values (2, 3, 1, 0.25)
 into hierarchy_test values (2, 4, 1, 0.25)
 into hierarchy_test values (2, 5, 1, 0.1)
 into hierarchy_test values (2, 4, 2, 0.4)
 into hierarchy_test values (4, 5, 1, 1)
 into hierarchy_test values (5, 6, 1, 0.3)
 into hierarchy_test values (5, 7, 1, 0.2)
 into hierarchy_test values (5, 8, 1, 0.5)
select * from dual;

SQL Fiddle

The following query returns the calculation I would like to make. Due to the nature of SYS_CONNECT_BY_PATH it can't, to my knowledge, perform the calculation itself.

 select a.*, level as lvl
      , '1' || sys_connect_by_path(percent_owned, ' * ') as calc
   from hierarchy_test a
  start with id = 1
connect by nocycle prior parent_id = id

There are cyclical relationships in the data, just not in this example.

At the moment I'm going to use a pretty simple function to turn the string in the calc column into a number

create or replace function some_sum ( P_Sum in varchar2 ) return number is
   l_result number;
begin  
   execute immediate 'select ' || P_Sum || ' from dual'
     into l_result;
     
   return l_result;   
end;
/

This seems to be a ridiculous way of going about it and I would rather avoid the additional time that will be taken parsing the dynamic SQL1.

Theoretically, I think, I should be able to use the MODEL clause to calculate this. My problem is caused by the non-uniqueness of the tree. One of my attempts at using the MODEL clause to do this is:

select *
  from ( select a.*, level as lvl
              , '1' || sys_connect_by_path(percent_owned, ' * ') as calc
           from hierarchy_test a
          start with id = 1
        connect by nocycle prior parent_id = id
                 )
 model
 dimension by (lvl ll, id ii)
 measures (percent_owned, parent_id )
 rules upsert all ( 
   percent_owned[any, any]
   order by ll, ii  = percent_owned[cv(ll), cv(ii)] * nvl( percent_owned[cv(ll) - 1, parent_id[cv(ll), cv(ii)]], 1)
               )

This, understandably, fails with the following:

ORA-32638: Non unique addressing in MODEL dimensions

Using UNIQUE SINGLE REFERENCE fails for a similar reason, namely that the ORDER BY clause is not unique.

tl;dr

Is there a simple way to calculate the percentage of the root of a tree owned by its parents using only SQL? If I'm on the right track with MODEL where am I going wrong?

1. I'd also like to avoid the PL/SQL SQL context-switch. I realise that this is a tiny amount of time but this is going to be difficult enough to do quickly without adding an additional few minutes a day.

Gameness answered 10/12, 2012 at 18:25 Comment(9)
Have you considered using recursive subquery factoring instead of connect-by?Sibella
@jonearles: I just wanted to ask the same :-) I actually tried to get a solution using that approach but it didn't work out: sqlfiddle.com/#!4/c3d5d/15Ashia
I have considered it @jonearles but this is a small portion of the code. I will also need to work out what is at the top of the tree and whether the data is cyclical, which are all extremely easy with connect-by. The other problem is speed, it's significantly slower. I may end up with no choice though...Gameness
I would love to play with this because it seems very interesting but I am confused. If no one answers for a while, could you draw a picture of the tree with your sample data?Repellent
Sure @gloomy.penguin, here you go: i.sstatic.net/r9Ziu.png. It's not very complexGameness
This has a few suggestions in it for aggregation on a hiearchical table in oracle: #12221547 and there are a few links in it to other SO questions you might want to check out, too. I'm not having much luck playing with it...Repellent
I guess what we are talking about here is a need of an EVAL() function.Jimmy
@DanielHilgarth I think you almost had it. You just have to change t.id = a.parent_id to t.parent_ID = a.id.Sibella
@jonearles: Indeed. Thanks for checking.Ashia
G
4

This deserves an answer; though beware we're operating under a few special circumstances.

The first thing to mention is that the best possible way to do this is with recursive sub-query factoring/a recursive CTE as per Daniel Hilgarth and jonearles in the comments:

with temp (id, parent_id, percent_owned, calc) as (
  select a.id, a.parent_id, a.percent_owned, percent_owned as calc
    from hierarchy_test a
   where id = 1
   union all
  select a.id, a.parent_id, a.percent_owned, a.percent_owned * t.calc as calc
    from temp t
    join hierarchy_test a
      on t.parent_id = a.id
         )
select * 
  from temp

Their SQL Fiddle..

Unfortunately the complexity of the query and the size of the data we#re working with is such that this turned out to be impossible. There was no way to do it without full-scanning some overly large tables each time.

This doesn't necessarily mean that we're back to CONNECT BY. There is the opportunity to calculate hierarchies in bulk. Unfortunately, this turned out to be impossible as well; an hour in the database crashed. Thrice. We were using up almost 100GB of UNDO and the server just couldn't cope.

These are the special circumstances; we have to calculate hundreds of thousands of hierarchies in a few hours, at most. The average one is about 1.5 levels deep with maybe 5-10 leaves and 8-12 nodes in total. However, the outliers have 90k nodes, 27 levels and multiple cyclic relationships. The outliers aren't anywhere near rare enough.

So, CONNECT BY. Benchmarking Annjawn's solution against the PL/SQL EXECUTE IMMEDIATE suggested in the question indicated that for an above average tree XMLQuery() was up to 4 times slower. Excellent, had the answer; no other option; leave it at that.

Not.

Because we're calculating so many hierarchies with so many nodes we ended up getting overly long waits from library cache pin locks caused by the constant hard-parsing of hundreds of thousands of mathematical functions in EXECUTE IMMEDIATE.

No obvious response to that, so going back too Annjawn's solution it ends up 3 times quicker! The library cache pin locks completely disappear and we're back on the straight and narrow.

Not.

Unfortunately, there seems to be an Oracle bug in 11.2 that appears when you combine CONNECT BY, XMLQuery() and DBMS_SCHEDULER. In certain circumstances, normally in the larger hierarchies, it leaks huge amounts of memory. Lost the database and the server finding that one out. A report's been raised with Oracle and we're testing in 12c; though the memory leaks exhibit less, they still appear so 12c is out.

The solution? Wrap the XMLQuery() in a PL/SQL function. Memory leak solved, unfortunately this caused a large amount of contention for this function, and we started getting multi-hour Library cache: mutex x waits.. Querying x$kglob confirmed it was XMLTYPE that was getting hammered.

Andrey Nikolaev recommends either altering the system; rather not do that when everything else works fine, or using the DBMS_POOL.MARKHOT procedure to tell Oracle that you'll be accessing this object a lot. To the casual eye, this may have solved the issue, however, after about 10 minutes, and going through what seemed to be every lock that Oracle has, we ended up with 5 processes contending for CPU. Apparently there wasn't enough (54GB and 24 cores on the test box)...

We then started getting Cursor pin: s waits. Burleson recommends more hidden parameter finangling, Jonathan Lewis suggests that it's down to SGA resizing. As the DB was using automated SGA sizing we tried gradually increasing the shared pool, up to 30GB and only got back out old friend the Library cache: mutex x wait.

So, what's the solution? Who knows is the honest answer, but a Java stored procedure is working brilliantly so far, no memory leaks, no waits and significantly quicker than everything else.

I'm sure there's more out there... and I'd really like to get the MODEL clause to work if anyone has any ideas?

P.S. I can't claim credit for all of this; it's the work of about 3 people to get us to this stage...

Gameness answered 29/7, 2013 at 20:44 Comment(0)
J
6

In 11g, Probably something like-

SELECT a.*, LEVEL AS lvl
      ,XMLQuery( substr( sys_connect_by_path( percent_owned, '*' ), 2 ) RETURNING CONTENT).getnumberval() AS calc
   FROM hierarchy_test a
  START WITH id = 1
CONNECT BY nocycle PRIOR parent_id = id;

SQL Fiddle.

Or, as per your '1'|| trick-

SELECT a.*, LEVEL AS lvl
      , XMLQuery( ('1'|| sys_connect_by_path( percent_owned, '*' )) RETURNING CONTENT).getnumberval() AS calc
   FROM hierarchy_test a
  START WITH id = 1
CONNECT BY nocycle PRIOR parent_id = id;

Unfortunately in 10g, XMLQuery cannot accept functions and always expects a string literal for evaluation for example-

select XMLQuery('1*0.25' RETURNING CONTENT).getnumberval() as val 
  from dual;

works and returns 0.25, but

select XMLQuery(substr('*1*0.25',2) RETURNING CONTENT).getnumberval() as val
   from dual;

gives ORA-19102: XQuery string literal expected.

The query might get slower as the number of levels on a tree increases with an additional overhead of internal tree creation by XMLQuery itself. The most optimum method to achieve the result would still be a PL/SQL Function which by the way would work both in 10g and 11g.

Jimmy answered 10/12, 2012 at 20:41 Comment(3)
Thank you very much, this is absolutely beautiful and I'll definitely remember it for the future. Unfortunately it's also a lot slower then using the PL/SQL. I've benchmarked it at 4 times slower on a large tree.Gameness
Yes, its not for the purpose of evaluating an expression as eval but it does the work. It is slower since it needs to create a tree on itself with each connect by result, so the more levels you have the slower it could get. I also felt like PLSQL is a better way to go but since you were looking a simpler SQL method to do it, I could only think of this. :)Jimmy
It turns out you were right and we would be using this now. But... we encountered a few small issues.Gameness
G
4

This deserves an answer; though beware we're operating under a few special circumstances.

The first thing to mention is that the best possible way to do this is with recursive sub-query factoring/a recursive CTE as per Daniel Hilgarth and jonearles in the comments:

with temp (id, parent_id, percent_owned, calc) as (
  select a.id, a.parent_id, a.percent_owned, percent_owned as calc
    from hierarchy_test a
   where id = 1
   union all
  select a.id, a.parent_id, a.percent_owned, a.percent_owned * t.calc as calc
    from temp t
    join hierarchy_test a
      on t.parent_id = a.id
         )
select * 
  from temp

Their SQL Fiddle..

Unfortunately the complexity of the query and the size of the data we#re working with is such that this turned out to be impossible. There was no way to do it without full-scanning some overly large tables each time.

This doesn't necessarily mean that we're back to CONNECT BY. There is the opportunity to calculate hierarchies in bulk. Unfortunately, this turned out to be impossible as well; an hour in the database crashed. Thrice. We were using up almost 100GB of UNDO and the server just couldn't cope.

These are the special circumstances; we have to calculate hundreds of thousands of hierarchies in a few hours, at most. The average one is about 1.5 levels deep with maybe 5-10 leaves and 8-12 nodes in total. However, the outliers have 90k nodes, 27 levels and multiple cyclic relationships. The outliers aren't anywhere near rare enough.

So, CONNECT BY. Benchmarking Annjawn's solution against the PL/SQL EXECUTE IMMEDIATE suggested in the question indicated that for an above average tree XMLQuery() was up to 4 times slower. Excellent, had the answer; no other option; leave it at that.

Not.

Because we're calculating so many hierarchies with so many nodes we ended up getting overly long waits from library cache pin locks caused by the constant hard-parsing of hundreds of thousands of mathematical functions in EXECUTE IMMEDIATE.

No obvious response to that, so going back too Annjawn's solution it ends up 3 times quicker! The library cache pin locks completely disappear and we're back on the straight and narrow.

Not.

Unfortunately, there seems to be an Oracle bug in 11.2 that appears when you combine CONNECT BY, XMLQuery() and DBMS_SCHEDULER. In certain circumstances, normally in the larger hierarchies, it leaks huge amounts of memory. Lost the database and the server finding that one out. A report's been raised with Oracle and we're testing in 12c; though the memory leaks exhibit less, they still appear so 12c is out.

The solution? Wrap the XMLQuery() in a PL/SQL function. Memory leak solved, unfortunately this caused a large amount of contention for this function, and we started getting multi-hour Library cache: mutex x waits.. Querying x$kglob confirmed it was XMLTYPE that was getting hammered.

Andrey Nikolaev recommends either altering the system; rather not do that when everything else works fine, or using the DBMS_POOL.MARKHOT procedure to tell Oracle that you'll be accessing this object a lot. To the casual eye, this may have solved the issue, however, after about 10 minutes, and going through what seemed to be every lock that Oracle has, we ended up with 5 processes contending for CPU. Apparently there wasn't enough (54GB and 24 cores on the test box)...

We then started getting Cursor pin: s waits. Burleson recommends more hidden parameter finangling, Jonathan Lewis suggests that it's down to SGA resizing. As the DB was using automated SGA sizing we tried gradually increasing the shared pool, up to 30GB and only got back out old friend the Library cache: mutex x wait.

So, what's the solution? Who knows is the honest answer, but a Java stored procedure is working brilliantly so far, no memory leaks, no waits and significantly quicker than everything else.

I'm sure there's more out there... and I'd really like to get the MODEL clause to work if anyone has any ideas?

P.S. I can't claim credit for all of this; it's the work of about 3 people to get us to this stage...

Gameness answered 29/7, 2013 at 20:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.