Merging users with multiple refs and count their collective assets
Asked Answered
O

3

12

There is a set of users. A person can have multiple users, but ref1 and ref2 might be alike and can therefore link users together. ref1 and ref2 does not overlap, one value in ref1 does not exist in ref2.

A user can own multiple assets. I want to "merge" users that has one or more refs alike and then count how many assets they own together. There could be missing entries in the user table, in that case I just want to propagate the owner into ref2 and set the asset_count and asset_ids.

Here is an example schema to illustrate:

Example assets

SELECT * FROM assets;
id name owner
1 #1 a
2 #2 b
3 #3 c
4 #4 a
5 #5 c
6 #6 d
7 #7 e
8 #8 d
9 #9 a
10 #10 a
11 #11 z

Example users

SELECT * FROM users;
id username ref1 ref2
1 bobo a d
2 toto b e
3 momo c d
4 lolo a f
5 popo c f

What I want to get in the end

SELECT * FROM results;
ids usernames refs1 refs2 asset_ids asset_count
1,3,4,5 bobo,momo,lolo,popo a,c d,f 1,3,4,5,6,8,9,10 8
2 toto b e 2,7 2
z 11 1

I've tried different approaches, but this is what I currently have:

Closest I have got

SELECT
  ARRAY_AGG(DISTINCT u.id) AS ids,
  ARRAY_AGG(DISTINCT u.username) AS usernames,
  ARRAY_AGG(DISTINCT u.ref1) AS refs1,
  ARRAY_AGG(DISTINCT u.ref2) AS refs2,
  COUNT(DISTINCT a.id) AS asset_count
FROM assets a
JOIN users u ON a.owner = u.ref1 OR a.owner = u.ref2
GROUP BY a.owner
ORDER BY MIN(a.id);
ids usernames refs1 refs2 asset_count
1,4 bobo,lolo a d,f 4
2 toto b e 1
3,5 momo,popo c d,f 2
1,3 bobo,momo a,c d 2
2 toto b e 1

If I merge the above table on ids, I almost get the result I want (without the missing entries in the user table). The merging can easily be done in code, but then I cannot paginate etc. I want to to this in DB layer if possible.

I want either a solution to the problem or a good explanation of why it is not possible to do (with examples).

Please check out my DB Fiddle.

Overlong answered 26/5, 2023 at 8:28 Comment(3)
Can ref1 and ref2 overlap? Or are these references to distinct sets, like the sample data seems to suggest? Either way, this could form a chain through all rows in the users table. Think of (a,x), (b,x), (b,y), (c,y), (c,z) etc. Such graph data cannot easily be resolved with set-based operations. You would start be defining exactly what can and cannot be in those columns ...Quirk
Hi erwin. ref1 and ref2 will in theory never overlap. A reference defined in ref1 will not be found in ref2 and if there is a "collision" it should not be counted as one.Raul
Updated my example and added bounty.Raul
U
5

There are two distinct parts to the question:

  • the first one is how to generate groups of users that have common references
  • the second part is how to distribute the assets in the groups, while taking in account orphan assets

Part 1 : a graph-walking problem

Identifying clusters of users based on common references reads like a graph-walking problem. That's a complex task in SQL, and requires a recursive query. The pattern is to unpivot users' references to generate nodes, then identify edges (nodes that have a ref in common), and finally walk through the graph (whitout looping) to generate groups.

In Postgres, arrays come handy to aggregate nodes:

with recursive 
    nodes as (
        select u.id, r.ref
        from users u 
        cross join lateral ( values (u.ref1), (u.ref2) ) r(ref)
    ),
    edges as (
        select distinct n1.id as id1, n2.id as id2
        from nodes n1 
        inner join nodes n2 on n1.ref = n2.ref
    ),
    rcte as (
        select id1, id2, array[id1] as visited from edges where id1 = id2
        union all
        select r.id1, e.id2, r.visited || e.id2
        from rcte r
        inner join edges e on e.id1 = r.id2
        where e.id2 <> all(r.visited) 
    ),
    groups as (
        select id1 as id, array_agg(distinct id2 order by id2) as ids
        from rcte
        group by id1
    )
select * from groups order by id
id ids
1 {1,3,4,5}
2 {2}
3 {1,3,4,5}
4 {1,3,4,5}
5 {1,3,4,5}

Part 2 : left join and aggregation

Now that we identified the groups, we can check for assets. Since you want all assets in the result, we start from the assets table, then bring the users and the groups with left joins. We can still group by the user groups, which puts all orphan assets in the same group (where the group is null) - that's exactly what we want.

The last step is array aggregation; the "propagation" of the owners of orphan assets to ref2 can be handled with a case expression.

with recursive 
    nodes  as (...),
    edges  as (...),
    rcte   as (...),
    groups as (...)
select g.ids,
    array_agg(distinct u.username) as usernames,
    array_agg(distinct u.ref1) as refs1,
    case when g.ids is null then array_agg(distinct a.owner) else array_agg(distinct u.ref2) end as refs2,
    array_agg(distinct a.id) as asset_ids,
    count(distinct a.id) as asset_count
from assets a
left join users u on a.owner in (u.ref1, u.ref2)
left join groups g on g.id = u.id
group by g.ids
ids usernames refs1 refs2 asset_ids asset_count
{1,3,4,5} {bobo,lolo,momo,popo} {a,c} {d,f} {1,3,4,5,6,8,9,10} 8
{2} {toto} {b} {e} {2,7} 2
null {NULL} {NULL} {z} {11} 1

Demo on DB Fiddlde


Add-on: Graph-walking performance

Performance will suffer on networks that have a lot of edges, especially if there are just few clusters in the graph, each containing with many users. We can try and optimize the query for such situation; the idea is to try and limit the number of paths that need to be walked, by aggregating all paths of each user at each iteration.

This query passes your test case with 200 users that all belong to the same cluster (whereas the first query exhausts the DB Fiddle resources):

with recursive 
    nodes as (
        select u.id, r.ref
        from users u 
        cross join lateral ( values (u.ref1), (u.ref2) ) r(ref)
    ),
    edges as (
        select distinct n1.id as id1, n2.id as id2
        from nodes n1 
        inner join nodes n2 on n1.ref = n2.ref
    ),
    rcte as (
        select id1 as id, array[id1] as visited from edges where id1 = id2
        union all
        select id, array_agg(distinct visited) as visited
        from (
            select r.id, v.visited
            from rcte r
            inner join edges e on e.id1 = any(r.visited)
            cross join lateral unnest(r.visited || e.id2) v(visited)
            where e.id2 <> all(r.visited)
        ) as x
        group by id
    ),
    groups as (
        select distinct on (id) id, visited as ids 
        from rcte 
        order by id, array_length(visited, 1) desc 
    )
select * from groups g order by id
Unbind answered 28/5, 2023 at 21:37 Comment(6)
A very elegant solution indeed! Thank you! Is there a way to improve performance of it? What indices you would suggest? When running on my "real" scenario the first part of the query runs for minutes and ends up eating up the disk space and finally crashes with no space left on device. I wonder if there are any circular refs in my dataset or not that is causing this? Is there a way to check that? There is around 30k users in my table.Raul
I'm getting 82k nodes and 114k edges. If I do where r.ref is not null I get 41k nodes and 114k edges.Raul
It seems like there is for example one person that has 200 users with the same ref in the nodes table. I've extracted this data to show the problem I currently have. Check out the db fiddle: dbfiddle.uk/GcgKiUJWRaul
any suggestions why those nodes breaks the query? rcte adds 200, then 40k then 8M. dbfiddle.uk/GcgKiUJWRaul
@BjørnBråthen; I see... I updated my answer with an attempt to optimize the query for your use case.Unbind
Thanks! it seems to run just fine now without running low on diskRaul
Q
2

This is hard (and typically very slow) to solve with only set-based operations in SQL. So I loop in a PL/pgSQL function:

CREATE OR REPLACE FUNCTION f_merged_users()
  RETURNS TABLE (
     ids int[]             -- adapt to your actual data types !!!
   , usernames text[]
   , refs1 text[]
   , refs2 text[]
   , asset_ids int[]
   , asset_count int
  )
  LANGUAGE plpgsql AS
$func$
BEGIN
   /*
   -- set enough temp buffers, so temp tables don't spill to disk (optional)
   -- NOTE: this can't be used after temporary tables have been accessed in the session
   -- using 2x size of users table. adapt to your needs!
   -- only fires when the new size is bigger, and it has not been set manually, yet
   PERFORM set_config('temp_buffers', (pg_table_size('users') * 2 / 8096)::text, true)
   FROM   pg_settings s
   WHERE  s.name = 'temp_buffers'
   AND    pg_table_size('users') * 2 / 8096 > s.setting::bigint
   AND    s.source = 'default';
   */

   -- create two temp tables: one for ref1, one for ref2
   -- based on your information that ref1 & ref2 don't overlap
   CREATE TEMP TABLE r1 ON COMMIT DROP AS
   SELECT ARRAY[ref1] AS r, array_agg(id) AS i
   FROM  (TABLE users ORDER BY id) t
   GROUP  BY ref1
   ORDER  BY ref1;

   CREATE TEMP TABLE r2 ON COMMIT DROP AS
   SELECT ARRAY[ref2] AS r, array_agg(id) AS i
   FROM  (TABLE users ORDER BY id) t
   GROUP  BY ref2
   ORDER  BY ref2;

   -- fold rows where a common link in the other table exists
   -- achieve that by deleting rows and inserting the merged results
   -- loop back and forth until no more rows can be folded
   LOOP
      WITH d AS (
         DELETE FROM r2
         USING  r1
         WHERE  EXISTS (
            SELECT FROM r2 r0
            WHERE  r1.i && r2.i
            AND    r1.i && r0.i
            AND    r2.ctid <> r0.ctid
            )
         RETURNING r1.ctid, r2.*
         )
      INSERT INTO r2
      SELECT r.r, i.i
      FROM  (
         SELECT ctid, array_agg(ref ORDER BY ref) AS r
         FROM   d, unnest(d.r) ref
         GROUP  BY 1
         ) r
      JOIN  (
         SELECT ctid, array_agg(id ORDER BY id) AS i
         FROM   d, unnest(d.i) id
         GROUP  BY 1
         )  i USING (ctid);

      EXIT WHEN NOT FOUND;  -- no folding happened, stop loop

      WITH d AS (
         DELETE FROM r1
         USING  r2
         WHERE  EXISTS (
            SELECT FROM r1 r0
            WHERE  r2.i && r1.i
            AND    r2.i && r0.i
            AND    r1.ctid <> r0.ctid
            )
         RETURNING r2.ctid, r1.*
         )
      INSERT INTO r1
      SELECT r.r, i.i
      FROM  (
         SELECT ctid, array_agg(ref ORDER BY ref) AS r
         FROM   d, unnest(d.r) ref
         GROUP  BY 1
         ) r
      JOIN  (
         SELECT ctid, array_agg(id ORDER BY id) AS i
         FROM   d, unnest(d.i) id
         GROUP  BY 1
         ) i USING (ctid);
         
      EXIT WHEN NOT FOUND;     -- no folding happened, stop loop
   END LOOP;

   -- output result
   RETURN QUERY
   (
   SELECT i                         -- AS ids
        , u.usernames
        , r1.r                      -- AS refs1
        , r2.r                      -- AS refs2
        , a.asset_ids
        , cardinality(a.asset_ids)  -- AS asset_count
   FROM   r1
   JOIN   r2 USING (i)
   CROSS  JOIN LATERAL (
      SELECT ARRAY (
         SELECT username
         FROM   users u
         WHERE  u.id = ANY (i)
         ORDER  BY u.id
         ) AS usernames
      ) u
   CROSS  JOIN LATERAL (
      SELECT ARRAY (
         SELECT a.id
         FROM   assets a
         WHERE  a.owner = ANY (r1.r || r2.r)
         ORDER  BY a.id
         ) AS asset_ids
      ) a
   ORDER  BY i
   )
   UNION ALL  --  add "missing entries in the user table"
   (
   SELECT null             -- AS ids
        , null             -- AS usernames
        , null             -- AS refs1
        , ARRAY[a.owner]   -- AS refs2
        , ARRAY[a.id]      -- AS ids
        , 1                -- AS asset_count
   FROM   assets a
   WHERE  NOT EXISTS (
      SELECT FROM users u
      WHERE  a.owner IN (u.ref1, u.ref2)
      )
   ORDER  BY a.id
   );
END
$func$;

fiddle

Call:

SELECT * FROM f_merged_users();

See inline comments with explanation.

The general idea is to avoid walking potentially lengthy graphs one step at time, and do as much set-based work as possible. The best solution heavily depends on actual data distribution.

Based on your information that ref1 an ref2 don't overlap, I create two temp tables: one for ref1, one for ref2. The initial queries aggregate all users on the same resource. That takes care of many users on the same resource (same person) right away. You commented that you have a lot of those.

In every iteration of the loop, I merge all rows in the other temp table that are linked together by the same row in this temp table, and vice versa. The two queries are completely symmetrical. (It may be more efficient to start with one or the other.) We are done as soon as one query does not find any more rows to merge.

Then return the final set with all details, and append "missing entries in the user table" like you added in the question.

I have streamlined the workflow, but DELETE and INSERT are comparatively expensive operations. Make sure you allow enough temporary buffers so that the two temp tables can operate in RAM exclusively. I added a bit of clever automatic setting to the top. But you may not need that if your temp_buffer setting is high enough anyway. And the automatic setting can't be used after you have accessed any temp tables in the same session. See:

The two temp tables are dropped at the end of the transaction automatically since created with ON COMMIT DROP.

Performance might be optimized some more with indexes on the temp tables and some more fine-tuning. In this case, it may also help to run ANALYZE (one or more times) on the temp tables. See:

About the ctid I used:

Quirk answered 30/5, 2023 at 4:1 Comment(2)
Thanks for the highly customised query! It seems to be running at least 5x faster than the graph walk solution. A couple of questions - thats not really part of the initial question. If I want to add a ref3 later, what needs to be done? and: Could I store the r1 and r2 tables as regular tables to boost the performance for the lookup part of the query? For the graph query it seems like I'm able to store the groups and just query those which makes me able to select assets based an a asset group id just in around 100ms on my current data set.Raul
Performance: nice. It's a shot in the dark. Can probably be optimized further depending on the complete setting. You can store anything you like, including r1 and r2. The usefulness depends on the complete picture. And you can restrict the query to selected assets, of course. Please ask any new questions in new questions. This one is quite epic already.Quirk
D
1

What you're showing is already a valid graph. With pgRouting, you can directly query it as such:

select chr(component::int)       as "the_group",
       array_agg(chr(node::int)) as "ids"
from pgr_connectedComponents(
  'SELECT id, 
          ascii(ref1) as source, 
          ascii(ref2) as target, 
          1 as cost, 
          1 as reverse_cost 
   FROM users') 
group by component;
-- the_group |    ids
-------------+-----------
-- a         | {a,c,d,f}
-- b         | {b,e}

ascii() and chr() are there to translate your 1-character refs to integers expected by the extension and back: ascii('a')=97, chr('97')='a'. Tested on a combination of a few million users forming links of varying number and length, it takes seconds and negligible amounts of memory on a single CPU. Under the hood, it's a Boost C++ DFS.

db<>fiddle, db-fiddle and sqlfiddle don't offer pgrouting, which is why I couldn't attach an online demo, but here's the test data generator I tortured this with.


Once SQL:2023's SQL/PGQ makes its way into a PostgreSQL release, you can expect a built-in mechanism instead.

Doiron answered 3/6, 2023 at 20:36 Comment(1)
Thanks for giving me an off-the-shelf alternative!Raul

© 2022 - 2024 — McMap. All rights reserved.