This is hard (and typically very slow) to solve with only set-based operations in SQL. So I loop in a PL/pgSQL function:
CREATE OR REPLACE FUNCTION f_merged_users()
RETURNS TABLE (
ids int[] -- adapt to your actual data types !!!
, usernames text[]
, refs1 text[]
, refs2 text[]
, asset_ids int[]
, asset_count int
)
LANGUAGE plpgsql AS
$func$
BEGIN
/*
-- set enough temp buffers, so temp tables don't spill to disk (optional)
-- NOTE: this can't be used after temporary tables have been accessed in the session
-- using 2x size of users table. adapt to your needs!
-- only fires when the new size is bigger, and it has not been set manually, yet
PERFORM set_config('temp_buffers', (pg_table_size('users') * 2 / 8096)::text, true)
FROM pg_settings s
WHERE s.name = 'temp_buffers'
AND pg_table_size('users') * 2 / 8096 > s.setting::bigint
AND s.source = 'default';
*/
-- create two temp tables: one for ref1, one for ref2
-- based on your information that ref1 & ref2 don't overlap
CREATE TEMP TABLE r1 ON COMMIT DROP AS
SELECT ARRAY[ref1] AS r, array_agg(id) AS i
FROM (TABLE users ORDER BY id) t
GROUP BY ref1
ORDER BY ref1;
CREATE TEMP TABLE r2 ON COMMIT DROP AS
SELECT ARRAY[ref2] AS r, array_agg(id) AS i
FROM (TABLE users ORDER BY id) t
GROUP BY ref2
ORDER BY ref2;
-- fold rows where a common link in the other table exists
-- achieve that by deleting rows and inserting the merged results
-- loop back and forth until no more rows can be folded
LOOP
WITH d AS (
DELETE FROM r2
USING r1
WHERE EXISTS (
SELECT FROM r2 r0
WHERE r1.i && r2.i
AND r1.i && r0.i
AND r2.ctid <> r0.ctid
)
RETURNING r1.ctid, r2.*
)
INSERT INTO r2
SELECT r.r, i.i
FROM (
SELECT ctid, array_agg(ref ORDER BY ref) AS r
FROM d, unnest(d.r) ref
GROUP BY 1
) r
JOIN (
SELECT ctid, array_agg(id ORDER BY id) AS i
FROM d, unnest(d.i) id
GROUP BY 1
) i USING (ctid);
EXIT WHEN NOT FOUND; -- no folding happened, stop loop
WITH d AS (
DELETE FROM r1
USING r2
WHERE EXISTS (
SELECT FROM r1 r0
WHERE r2.i && r1.i
AND r2.i && r0.i
AND r1.ctid <> r0.ctid
)
RETURNING r2.ctid, r1.*
)
INSERT INTO r1
SELECT r.r, i.i
FROM (
SELECT ctid, array_agg(ref ORDER BY ref) AS r
FROM d, unnest(d.r) ref
GROUP BY 1
) r
JOIN (
SELECT ctid, array_agg(id ORDER BY id) AS i
FROM d, unnest(d.i) id
GROUP BY 1
) i USING (ctid);
EXIT WHEN NOT FOUND; -- no folding happened, stop loop
END LOOP;
-- output result
RETURN QUERY
(
SELECT i -- AS ids
, u.usernames
, r1.r -- AS refs1
, r2.r -- AS refs2
, a.asset_ids
, cardinality(a.asset_ids) -- AS asset_count
FROM r1
JOIN r2 USING (i)
CROSS JOIN LATERAL (
SELECT ARRAY (
SELECT username
FROM users u
WHERE u.id = ANY (i)
ORDER BY u.id
) AS usernames
) u
CROSS JOIN LATERAL (
SELECT ARRAY (
SELECT a.id
FROM assets a
WHERE a.owner = ANY (r1.r || r2.r)
ORDER BY a.id
) AS asset_ids
) a
ORDER BY i
)
UNION ALL -- add "missing entries in the user table"
(
SELECT null -- AS ids
, null -- AS usernames
, null -- AS refs1
, ARRAY[a.owner] -- AS refs2
, ARRAY[a.id] -- AS ids
, 1 -- AS asset_count
FROM assets a
WHERE NOT EXISTS (
SELECT FROM users u
WHERE a.owner IN (u.ref1, u.ref2)
)
ORDER BY a.id
);
END
$func$;
fiddle
Call:
SELECT * FROM f_merged_users();
See inline comments with explanation.
The general idea is to avoid walking potentially lengthy graphs one step at time, and do as much set-based work as possible. The best solution heavily depends on actual data distribution.
Based on your information that ref1
an ref2
don't overlap, I create two temp tables: one for ref1
, one for ref2
. The initial queries aggregate all users on the same resource. That takes care of many users on the same resource (same person) right away. You commented that you have a lot of those.
In every iteration of the loop, I merge all rows in the other temp table that are linked together by the same row in this temp table, and vice versa. The two queries are completely symmetrical. (It may be more efficient to start with one or the other.) We are done as soon as one query does not find any more rows to merge.
Then return the final set with all details, and append "missing entries in the user table" like you added in the question.
I have streamlined the workflow, but DELETE
and INSERT
are comparatively expensive operations. Make sure you allow enough temporary buffers so that the two temp tables can operate in RAM exclusively. I added a bit of clever automatic setting to the top. But you may not need that if your temp_buffer
setting is high enough anyway. And the automatic setting can't be used after you have accessed any temp tables in the same session. See:
The two temp tables are dropped at the end of the transaction automatically since created with ON COMMIT DROP
.
Performance might be optimized some more with indexes on the temp tables and some more fine-tuning. In this case, it may also help to run ANALYZE
(one or more times) on the temp tables. See:
About the ctid
I used:
ref1
andref2
overlap? Or are these references to distinct sets, like the sample data seems to suggest? Either way, this could form a chain through all rows in theusers
table. Think of(a,x), (b,x), (b,y), (c,y), (c,z)
etc. Such graph data cannot easily be resolved with set-based operations. You would start be defining exactly what can and cannot be in those columns ... – Quirkref1
andref2
will in theory never overlap. A reference defined inref1
will not be found inref2
and if there is a "collision" it should not be counted as one. – Raul