Visiting a directed graph as if it were an undirected one, using a recursive query
Asked Answered
H

1

12

I need your help about the visit of a directed graph stored in a database.

Consider the following directed graph

1->2 
2->1,3 
3->1

A table stores those relations:

create database test;
\c test;

create table ownership (
    parent bigint,
    child  bigint,
    primary key (parent, child)
);

insert into ownership (parent, child) values (1, 2);
insert into ownership (parent, child) values (2, 1);
insert into ownership (parent, child) values (2, 3);
insert into ownership (parent, child) values (3, 1);

I'd like to extract all the semi-connected edges (i.e. the connected edges ignoring the direction) of the graph reachable from a node. I.e., if I start from parent=1, I'd like to have the following output

1,2
2,1
2,3
3,1

I'm using postgresql.

I've modified the example on Postgres' manual which explains recursive queries, and I've adapted the join condition to go "up" and "down" (doing so I ignore the directions). My query is the following one:

\c test

WITH RECURSIVE graph(parent, child, path, depth, cycle) AS (
SELECT o.parent, o.child, ARRAY[ROW(o.parent, o.child)], 0, false
    from ownership o
    where o.parent = 1
UNION ALL
SELECT 
    o.parent, o.child,
    path||ROW(o.parent, o.child), 
    depth+1, 
    ROW(o.parent, o.child) = ANY(path)
    from 
        ownership o, graph g
    where 
        (g.parent = o.child or g.child = o.parent) 
        and not cycle

)
select  g.parent, g.child, g.path, g.cycle
from
    graph g

its output follows:

 parent | child |               path                | cycle 
--------+-------+-----------------------------------+-------
      1 |     2 | {"(1,2)"}                         | f
      2 |     1 | {"(1,2)","(2,1)"}                 | f
      2 |     3 | {"(1,2)","(2,3)"}                 | f
      3 |     1 | {"(1,2)","(3,1)"}                 | f
      1 |     2 | {"(1,2)","(2,1)","(1,2)"}         | t
      1 |     2 | {"(1,2)","(2,3)","(1,2)"}         | t
      3 |     1 | {"(1,2)","(2,3)","(3,1)"}         | f
      1 |     2 | {"(1,2)","(3,1)","(1,2)"}         | t
      2 |     3 | {"(1,2)","(3,1)","(2,3)"}         | f
      1 |     2 | {"(1,2)","(2,3)","(3,1)","(1,2)"} | t
      2 |     3 | {"(1,2)","(2,3)","(3,1)","(2,3)"} | t
      1 |     2 | {"(1,2)","(3,1)","(2,3)","(1,2)"} | t
      3 |     1 | {"(1,2)","(3,1)","(2,3)","(3,1)"} | t
(13 rows)

I have a problem: the query extracts the same edges many times, as they are reached through different paths, and I'd like to avoid this. If I modify the outer query into

select  distinct g.parent, g.child from graph

I have the desired result, but inefficiencies remain in the WITH query, as unneeded joins are done. So, is there a solution to extract the reachable edges of a graph in a db, starting from a given one, without using distinct?

I also have another problem (this one is solved, look at the bottom): as you can see from the output, cycles stop only when a node is reached for the second time. I.e. I have (1,2) (2,3) (1,2). I'd like to stop the cycle before cycling over that last node again, i.e. having (1,2) (2,3). I've tried to modify the where condition as follows

where
    (g.parent = o.child or g.child = o.parent) 
    and (ROW(o.parent, o.child) <> any(path))
    and not cycle

to avoid visiting already visited edges, but it doesn't work and I cannot understand why ((ROW(o.parent, o.child) <> any(path)) should avoid cycling before going on the cycled edge again but doesn't work). How can I do to stop cycles one step before the node that closes the cycle?

Edit: as danihp suggested, to solve the second problem I used

where
    (g.parent = o.child or g.child = o.parent) 
    and not (ROW(o.parent, o.child) = any(path))
    and not cycle

and now the output contains no cycles. Rows went from 13 to 6, but I still have duplicates, so the main (the first) problem of extracting all the edges without duplicates and without distinct is still alive. Current output with and not ROW

 parent | child |           path            | cycle 
--------+-------+---------------------------+-------
      1 |     2 | {"(1,2)"}                 | f
      2 |     1 | {"(1,2)","(2,1)"}         | f
      2 |     3 | {"(1,2)","(2,3)"}         | f
      3 |     1 | {"(1,2)","(3,1)"}         | f
      3 |     1 | {"(1,2)","(2,3)","(3,1)"} | f
      2 |     3 | {"(1,2)","(3,1)","(2,3)"} | f
(6 rows)

Edit #2:: following what Erwin Brandstetter suggested, I modified my query, but if I'm not wrong, the proposed query gives MORE rows than mine (ROW comparison is still there as it seems more clear to me, even I understood that string comparison will be more efficient). Using the new query, I obtain 20 rows, while mine gives 6 rows

WITH RECURSIVE graph(parent, child, path, depth) AS (
SELECT o.parent, o.child, ARRAY[ROW(o.parent, o.child)], 0
    from ownership o
    where 1 in (o.child, o.parent)
UNION ALL
SELECT 
    o.parent, o.child,
    path||ROW(o.parent, o.child), 
    depth+1
    from 
        ownership o, graph g
    where 
        g.child in (o.parent, o.child) 
        and ROW(o.parent, o.child) <> ALL(path)

)
select  g.parent, g.child from graph g

Edit 3: so, as Erwin Brandstetter pointed out, the last query was still wrong, while the right one can be found in his answer.

When I posted my first query, I hadn't understood that I was missing some joins, as it happens in the following case: if I start with the node 3, the db selects the rows (2,3) and (3,1). Then, the first inductive step of the query would select, joining from these rows, the rows (1,2), (2,3) and (3,1), missing the row (2,1) that should be included in the result as conceptually the algorithm would imply ( (2,1) is "near" (3,1) )

When I tried to adapt the example in Postgresql manual, I was right trying to join ownership's parent and child, but I was wrong not saving the value of graph that had to be joined in each step.

These type of queries seem to generate a different set of rows depending on the starting node (i.e. depending on the set of rows selected in the base step). So, I think it could be useful to select just one row containing the starting node in the base step, as you'll get any other "adjacent" node anyway.

Humanitarianism answered 6/1, 2012 at 21:22 Comment(8)
looks like a map reduce problemMaypole
have you try with and NOT (ROW(o.parent, o.child) = any(path)) ? Is not enought with: select distinct g.parent, g.child from graph g where cycle = f ?Willful
thank you danihp, "and not ROW" worked!, so the second problem is solved! (going to update the question). Rows went from 13 to 6! But I still have duplicates. You are right, using distinct the output would be what I need, but some unneeded joins would be still in the WITH query, and if you have a large graph, it can be a performance problem. I'd like to have the best efficiency possible, not having - if possible - duplicates even not using distinctHumanitarianism
You say it is a "directed graph", and yet you traverse the graph from child to parent, too? Is that intended?Falster
yes, I need to do this kind of traversal as I need to "explode" a node and find the connected component of the graph starting from that nodeHumanitarianism
But then it is not a directed graph. As I understand it, the nature of "directed" is the direction of the edges, or am I wrong there?Falster
You are right, but the kind of explosion I need is one that lets me see "around" a given node. I have companies and ownership linking them, so given a company I need the incoming and outgoing ownership links. Your query could be the solution, I'll check better tomorrowHumanitarianism
As an aside, simpler syntax: (ROW(o.parent, o.child) <> ALL(path)) instead of NOT (ROW(o.parent, o.child) = ANY(path)).Falster
F
8

Could work like this:

WITH RECURSIVE graph AS (
    SELECT parent
          ,child
          ,',' || parent::text || ',' || child::text || ',' AS path
          ,0 AS depth
    FROM   ownership
    WHERE  parent = 1

    UNION ALL
    SELECT o.parent
          ,o.child
          ,g.path || o.child || ','
          ,g.depth + 1
    FROM   graph g
    JOIN   ownership o ON o.parent = g.child
    WHERE  g.path !~~ ('%,' || o.parent::text || ',' || o.child::text || ',%')
    )
SELECT  *
FROM    graph

You mentioned performance, so I optimized in that direction.

Major points:

  • Traverse the graph only in the defined direction.

  • No need for a column cycle, make it an exclusion condition instead. One less step to go. That is also the direct answer to:

How can I do to stop cycles one step before the node that closes the cycle?

  • Use a string to record the path. Smaller and faster than an array of rows. Still contains all necessary information. Might change with very big bigint numbers, though.

  • Check for cycles with the LIKE operator (~~), should be much faster.

  • If you don't expect more that 2147483647 rows over the course of time, use plain integer columns instead of bigint. Smaller and faster.

  • Be sure to have an index on parent. Index on child is irrelevant for my query. (Other than in your original where you traverse edges in both directions.)

  • For huge graphs I would switch to a plpgsql procedure, where you can maintain the path as temp table with one row per step and a matching index. A bit of an overhead, that will pay off with huge graphs, though.


Problems in your original query:

WHERE (g.parent = o.child or g.child = o.parent) 

There is only one endpoint of your traversal at any point in the process. As you wlak the directed graph in both directions, the endpoint can be either parent or child - but not both of them. You have to save the endpoint of every step, and then:

WHERE g.child IN (o.parent, o.child) 

The violation of the direction also makes your starting condition questionable:

WHERE parent = 1

Would have to be

WHERE 1 IN (parent, child)

And the two rows (1,2) and (2,1) are effectively duplicates this way ...


Additional solution after comment

  • Ignore direction
  • Still walk any edge only once per path.
  • Use ARRAY for path
  • Save original direction in path, not actual direction.

Note, that this way (2,1) and (1,2) are effective duplicates, but both can be used in the same path.

I introduce the column leaf which saves the actual endpoint of every step.

WITH RECURSIVE graph AS (
    SELECT CASE WHEN parent = 1 THEN child ELSE parent END AS leaf
          ,ARRAY[ROW(parent, child)] AS path
          ,0 AS depth
    FROM   ownership
    WHERE  1 in (child, parent)

    UNION ALL
    SELECT CASE WHEN o.parent = g.leaf THEN o.child ELSE o.parent END -- AS leaf
          ,path || ROW(o.parent, o.child) -- AS path
          ,depth + 1 -- AS depth
    FROM   graph g
    JOIN   ownership o ON g.leaf in (o.parent, o.child) 
    AND    ROW(o.parent, o.child) <> ALL(path)
    )
SELECT *
FROM   graph
Falster answered 6/1, 2012 at 22:58 Comment(5)
at first glance, it seems to me you find the descendants of the nodes, and this is not what I asked for, but your use of string is and interesting one, and you're right, no need of a cycle columnHumanitarianism
I tried to modify the query using your condition only on the child, but, as you can read from my edit #2, I obtain more rows than before. It's not completely clear to me why, though.Humanitarianism
@cdarwin: You still got a logical bug. g.child in (o.parent, o.child) would suggest that you walk the graph in the denoted direction, from parent to child. But you allow both directions, so you have to save the current end point of every step explicitly (parent or child). I added a solution for that scenario.Falster
you were absolutely right. I wrote some final considerations as edit #3. I hoped a more efficient solution could be found just using recursive queries, but I now understand it's not possible, and as you say one has to use plpgsql to do a "classical" visitHumanitarianism
@erwin-brandstetter, where may I find an example of plpgsql procedure for such big graphs? Thanks!Dictation

© 2022 - 2024 — McMap. All rights reserved.