How to count all the connected nodes (rows) in a graph on Postgres?

Asked 12/8, 2017 at 0:13 Answered 12/8, 2017 at 13:11

Solved sql postgresql graph connected-components

My table has account_id and device_id. One account_id could have multiple device_ids and vice versa. I am trying to count the depth of each connected many-to-many relationship.

Ex:

account_id | device_id
1 | 10
1 | 11
1 | 12
2 | 10
3 | 11
3 | 13
3 | 14
4 | 15
5 | 15
6 | 16

How do I construct a query that knows to combine accounts 1-3 together, 4-5 together, and leave 6 by itself? All 7 entries of accounts 1-3 should be grouped together because they all touched the same account_id or device_id at some point. I am trying to group them together and output the count.

Account 1 was used on device's 10, 11, 12. Those devices used other accounts too so we want to include them in the group. They used additional accounts 2 and 3. But account 3 was further used by 2 more devices so we will include them as well. The expansion of the group brings in any other account or device that also "touched" an account or device already in the group.

A diagram is shown below:

Bahuvrihi answered 12/8, 2017 at 0:13 Comment(4)

I can't help but feel like we need some more information to help you out here. Please elaborate further on how these groups are derived. – Adao 12/8, 2017 at 0:19

@Adao Sorry for the confusion. account_id 1 has seen device_ids 10, 11, 12. device_id 10 and 11 also have other accounts that saw them. account_id 2 also saw device_id 10 and account_id 3 saw device_id 11. Since account_id 3 has been "touched" we also need to include all other device_ids that it has seen, so device_id 13 and 15. – Bahuvrihi 12/8, 2017 at 0:28

Interesting question. Will definitely vote for an elegant answer. My 5 cents: probably EXTENSION intarray with its "intersection for arrays" + "WITH RECURSIVE" approach together can make it neat. – Celsacelsius 12/8, 2017 at 4:12

Suggest changing the title to something like: "find (connected) clusters in a graph" (there could be a better mathematical wording) – Vernverna 12/8, 2017 at 13:14

You can use a recursive cte:

with recursive t(account_id, device_id) as (
       select 1, 10 union all
       select 1, 11 union all
       select 1, 12 union all
       select 2, 10 union all
       select 3, 11 union all
       select 3, 13 union all
       select 3, 14 union all
       select 4, 15 union all
       select 5, 15 union all
       select 6, 16
     ),
     a as (
      select distinct t.account_id as a, t2.account_id as a2
      from t join
           t t2
           on t2.device_id = t.device_id and t.account_id >= t2.account_id
     ),
     cte as (
      select a.a, a.a2 as mina
      from a
      union all
      select a.a, cte.a
      from cte join
           a
           on a.a2 = cte.a and a.a > cte.a
     )
select grp, array_agg(a)
from (select a, min(mina) as grp
      from cte
      group by a
     ) a
group by grp;

Here is a SQL Fiddle.

Relay answered 12/8, 2017 at 1:50 Comment(1)

Thanks @GordonLinoff. It seems this approach does great aggregating the groups as desired, but when I try to do a count, I am unable to get the 7 rows for group 1. Do you have any thoughts on how to get an accurate count from the grouping as well? I will keep studying this answer on my end. – Bahuvrihi 12/8, 2017 at 8:30

You can GROUP BY the device_id and then aggregate together the account_id into a Postgres array. Here is an example query, although I'm not sure what your actual table name is.

SELECT
  device_id,
  array_agg(account_id) as account_ids
FROM account_device --or whatever your table name is
GROUP BY device_id;

Results - hope it's what you're looking for:

16 | {6}
15 | {4,5}
13 | {3}
11 | {1,3}
14 | {3}
12 | {1}
10 | {1,2}

Gisela answered 12/8, 2017 at 0:42 Comment(2)

Thanks for the reply. I think that gets me half way there. So notice how device 10 and 11 both share account_id 1? Because it is shared, I want to get those grouped together. Basically I want to somehow get a group like device_ids {10,11,12,13,14} | {1,2,3} – Bahuvrihi 12/8, 2017 at 0:49

Ah now I get it! Yeah that's going to be a much more complex query, but still very doable. As one of the other answers suggests, a recursive CTE is good for things like this that involve "graph traversal". Here is a more comprehensive resource slideshare.net/quipo/rdbms-in-the-social-networks-age/… – Gisela 12/8, 2017 at 5:47

-- \i tmp.sql

CREATE TABLE graph(
        account_id integer NOT NULL --references accounts(id)
        , device_id integer not NULL --references(devices(id)
        ,PRIMARY KEY(account_id, device_id)
        );

INSERT INTO graph (account_id, device_id)VALUES
 (1,10) ,(1,11) ,(1,12)
,(2,10)
,(3,11) ,(3,13) ,(3,14)
,(4,15)
,(5,15)
,(6,16)
        ;

-- SELECT* FROM graph ;

        -- Find the (3) group leaders
WITH seed AS (
        SELECT row_number() OVER () AS cluster_id -- give them a number
        , g.account_id
        , g.device_id
        FROM graph g
        WHERE NOT EXISTS (SELECT*
                FROM graph nx
                WHERE (nx.account_id = g.account_id OR nx.device_id = g.device_id)
                AND nx.ctid < g.ctid
                )
        )
-- SELECT * FROM seed;
;

WITH recursive omg AS (
        --use the above CTE in a sub-CTE
        WITH seed AS (
                SELECT row_number()OVER () AS cluster_id
                , g.account_id
                , g.device_id
                , g.ctid AS wtf --we need an (ordered!) canonical id for the tuples
                                -- ,just to identify and exclude them
                FROM graph g
                WHERE NOT EXISTS (SELECT*
                        FROM graph nx
                        WHERE (nx.account_id = g.account_id OR nx.device_id = g.device_id) AND nx.ctid < g.ctid
                        )
                )
        SELECT s.cluster_id
                , s.account_id
                , s.device_id
                , s.wtf
        FROM seed s
        UNION ALL
        SELECT o.cluster_id
        , g.account_id
        , g.device_id
        , g.ctid AS wtf
        FROM omg o
        JOIN graph g ON (g.account_id = o.account_id OR g.device_id = o.device_id)
                         -- AND (g.account_id > o.account_id OR g.device_id > o.device_id)
                         AND g.ctid > o.wtf
                -- we still need to exclude duplicates here
                -- (which could occur if there are cycles in the graph)
                -- , this could be done using an array
        )
SELECT *
FROM omg
ORDER BY cluster_id, account_id,device_id
        ;

Results:

DROP SCHEMA
CREATE SCHEMA
SET
CREATE TABLE
INSERT 0 10
 cluster_id | account_id | device_id 
------------+------------+-----------
          1 |          1 |        10
          2 |          4 |        15
          3 |          6 |        16
(3 rows)

 cluster_id | account_id | device_id |  wtf   
------------+------------+-----------+--------
          1 |          1 |        10 | (0,1)
          1 |          1 |        11 | (0,2)
          1 |          1 |        12 | (0,3)
          1 |          1 |        12 | (0,3)
          1 |          2 |        10 | (0,4)
          1 |          3 |        11 | (0,5)
          1 |          3 |        13 | (0,6)
          1 |          3 |        14 | (0,7)
          1 |          3 |        14 | (0,7)
          2 |          4 |        15 | (0,8)
          2 |          5 |        15 | (0,9)
          3 |          6 |        16 | (0,10)
(12 rows)

Newer version (I added an Id column to the table)

        -- for convenience :set of all adjacent nodes.
CREATE TEMP VIEW pair AS
        SELECT one.id AS one
                , two.id AS two
        FROM graph one
        JOIN graph two ON (one.account_id = two.account_id OR one.device_id = two.device_id)
                AND one.id <> two.id
        ;

WITH RECURSIVE flood AS (
        SELECT g.id, g.id AS parent_id
        , 0 AS lev
        , ARRAY[g.id]AS arr
        FROM graph g
        UNION ALL
        SELECT c.id, p.parent_id AS parent_id
        , 1+p.lev AS lev
        , p.arr || ARRAY[c.id] AS arr
        FROM graph c
        JOIN flood p ON EXISTS (
                        SELECT * FROM pair WHERE p.id = pair.one AND c.id = pair.two)
                AND p.parent_id < c.id
                AND NOT p.arr @> ARRAY[c.id]    -- avoid cycles/loops
        )
SELECT g.*, a.parent_id
        , dense_rank() over (ORDER by a.parent_id)AS group_id
FROM graph g
JOIN (SELECT id, MIN(parent_id) AS parent_id
        FROM flood
        GROUP BY id
        ) a
ON g.id = a.id
ORDER BY a.parent_id, g.id
        ;

New results:

CREATE VIEW
 id | account_id | device_id | parent_id | group_id 
----+------------+-----------+-----------+----------
  1 |          1 |        10 |         1 |        1
  2 |          1 |        11 |         1 |        1
  3 |          1 |        12 |         1 |        1
  4 |          2 |        10 |         1 |        1
  5 |          3 |        11 |         1 |        1
  6 |          3 |        13 |         1 |        1
  7 |          3 |        14 |         1 |        1
  8 |          4 |        15 |         8 |        2
  9 |          5 |        15 |         8 |        2
 10 |          6 |        16 |        10 |        3
(10 rows)

Vernverna answered 12/8, 2017 at 13:11 Comment(2)

Same problem, try it with this input (1, 100), (1, 101), (2, 101), (2, 102), (3, 102), (3, 103), (4, 103), (4, 104), (5, 104), (5, 105) - it's one group, not four. – Celsacelsius 12/8, 2017 at 17:1

Yep, That is because my g.ctid > o.wtf is a bit too restrictive (same for the seed) . Really need an array/set to grow in both (all four) directions and dedupe. [a better way is probably just a simple iterative union-find kind of thing] – Vernverna 12/8, 2017 at 17:41

Recommended topics

Hot tags