How to efficiently set subtract a join table in PostgreSQL?

Asked 22/11, 2017 at 17:38 Answered 4/12, 2017 at 22:13

sql postgresql performance relational-division set-operations

I have the following tables:

work_units - self explanatory
workers - self explanatory
skills - every work unit requires a number of skills if you want to work on it. Every worker is proficient in a number of skills.
work_units_skills - join table
workers_skills - join table

A worker can request the next appropriate free highest priority (whatever that means) unit of work to be assigned to her.

Currently I have:

SELECT work_units.*
FROM work_units
-- some joins
WHERE NOT EXISTS (
        SELECT skill_id
        FROM work_units_skills
        WHERE work_unit_id = work_units.id

        EXCEPT

        SELECT skill_id
        FROM workers_skills
        WHERE worker_id = 1 -- the worker id that made the request
      )
-- AND a bunch of other conditions
-- ORDER BY something complex
LIMIT 1
FOR UPDATE SKIP LOCKED;

This condition makes the query 8-10 times slower though.

Is there a better way to express that a work_units's skills should be a subset of the workers's skills or something to improve the current query?

Some more context:

The skills table is fairly small.
Both work_units and workers tend to have very few associated skills.
work_units_skills has index on work_unit_id.
I tried moving the query on workers_skills into a CTE. This gave a slight improvement (10-15%), but it's still too slow.
A work unit with no skill can be picked up by any user. Aka an empty set is a subset of every set.

Reactor answered 22/11, 2017 at 17:38 Comment(5)

I think the devil could be in the missing details within the comments ( like ORDER BY something complex bunch of conditions etc). So, If you could post the EXPLAIN, it might be helpful – Allergen 22/11, 2017 at 18:3

@KaushikNayak, I tried removing individual conditions and using something simpler for ordering. The query was still that much slower with this. So it's not a combination of this and some other condition. May be this and two+ other conditions, but it's unlikely. Unfortunately I can't post the EXPLAIN as the project is private, but I can answer your questions if you have any hunches. – Reactor 22/11, 2017 at 18:17

Please Edit your question and add the the execution plan generated using explain (analyze, verbose, buffers). Formatted text please, no screen shots. If you don't want to (or can't) share table names, upload it to explain.depesz.com and enable the option to obfuscate the plan (although an execution plan seldomly reveals anything that is confidential) – Deferral 27/11, 2017 at 7:42

2 questions? 1. can I offer some changes in your database design or cannot? for example adding 1 or 2 extra fields. 2. which DBMS do you work? – Faceoff 27/11, 2017 at 9:25

@g.Irani, for the former - you can. In fact I was thinking of two such solutions with an extra column - one involving bitmasks, the other one - hashes. Both were dramatically faster (by ~60% on dump benchmarks), but still didn't seem fast enough. For the latter - as per the tag - postgres. – Reactor 27/11, 2017 at 9:34

One simple speed-up would be to use EXCEPT ALL instead of EXCEPT. The latter removes duplicates, which is unnecessary here and can be slow.

An alternative that would probably be faster is to use a further NOT EXISTS instead of the EXCEPT:

...
WHERE NOT EXISTS (
        SELECT skill_id
        FROM work_units_skills wus
        WHERE work_unit_id = work_units.id
        AND NOT EXISTS (
            SELECT skill_id
            FROM workers_skills ws
            WHERE worker_id = 1 -- the worker id that made the request
              AND ws.skill_id = wus.skill_id
        )
      )

Demo

http://rextester.com/AGEIS52439 - with the the LIMIT removed for testing

Charleycharlie answered 27/11, 2017 at 9:10 Comment(1)

Nice one. Both simple and obvious changes leading to ~50% boost. Still, I will experiment more with other solutions as this is still not fast enough. Probably I will use it in combination with something else. – Reactor 28/11, 2017 at 9:26

(see UPDATE below)

This query finds a good work_unit using a simple LEFT JOIN to find a missing skill in the shorter table of skills the requesting worker has. The trick is whenever there is a missing skill, there will be a NULL value in the join and this is translated to a 1 and the work_unit is removed by leaving the ones with all 0 values i.e. having a max of 0.

Being classic SQL this would be the most heavily targeted query for optimization by the engine:

SELECT work_unit_id
FROM
  work_units_skills s
LEFT JOIN
  (SELECT skill_id FROM workers_skills WHERE worker_id = 1) t
ON (s.skill_id=t.skill_id)
GROUP BY work_unit_id
HAVING max(CASE WHEN t.skill_id IS NULL THEN 1 ELSE 0 END)=0
-- AND a bunch of other conditions
-- ORDER BY something complex
LIMIT 1
FOR UPDATE SKIP LOCKED;

UPDATE

In order to catch work_units with no skills, we throw the work_units table into the JOIN:

SELECT r.id AS work_unit_id
FROM
  work_units r
LEFT JOIN
  work_units_skills s ON (r.id=s.work_unit_id)
LEFT JOIN
  (SELECT skill_id FROM workers_skills WHERE worker_id = 1) t
ON (s.skill_id=t.skill_id)
GROUP BY r.id
HAVING bool_or(s.skill_id IS NULL) OR bool_and(t.skill_id IS NOT NULL)
-- AND a bunch of other conditions
-- ORDER BY something complex
LIMIT 1
FOR UPDATE SKIP LOCKED;

Probe answered 27/11, 2017 at 10:17 Comment(5)

COUNT(t.skill_id) = COUNT(s.skil_id) also results in the same logic. As noticed by @RadimBača this requires a slight modification to allow a work_unit with no skills to be considered valid (predicated on an empty set being a sub-set of every other set). – Leroi 28/11, 2017 at 0:51

Nice one as well. This is about 6-8 times faster than my original query, which is in the acceptable range. However, it excludes work units without any skills. I couldn't think of a way to include them as well without an OR, which made it way slower (only 2 times faster than the original query). If you can think of a way to include documents without skills and keep close to the current performance that may be it. – Reactor 28/11, 2017 at 9:39

@ndn Other than the HAVING clause this is the same as the sub-query in the first part of my answer below (functionally they're identical, just different algebra for counting missing skills). By first joining on the work _units table this can be made to work for work units with no skills; see the third query in my answer. – Leroi 28/11, 2017 at 11:15

@ndn Try joining the sub-query to your documents table, rather than using IN() and you may improve performance. (Also, where did the documents table come in to this? Is the first time you've ever mentioned it ;) ) – Leroi 28/11, 2017 at 11:19

@ndn [I've updated the answer with a query which catches unskilled work units]. The reason the query misses work units without skills is probably that these do not appear in work_units_skills table. To fix this, when creating the work_units_skills table do a LEFT JOIN. This will leave a row for each work_unit and if there are no skills, it will have a NULL in skill_id. Otherwise, another solution would be to add the work_units table to the JOIN query and not to do an OR or UNION. But expanding the work_units_skills table seems to be a better option. – Probe 28/11, 2017 at 11:55

You may use the following query

SELECT wu.*
FROM work_units wu
LEFT JOIN work_units_skills wus ON wus.work_unit_id = wu.id and wus.skill_id IN (
    SELECT id
    FROM skills
    EXCEPT
    SELECT skill_id
    FROM workers_skills
    WHERE worker_id = 1 -- the worker id that made the request
)
WHERE wus.work_unit_id IS NULL;

demo (thanks, Steve Chambers for most of the data)

You should definitely have index on work_units_skills(skill_id), workers_skills(worker_id) and work_units(id). If you want to speed it up, even more, create indexes work_units_skills(skill_id, work_unit_id) and workers_skills(worker_id, skill_id) which avoid accessing those tables.

The subquery is independent and outer join should relatively fast if the result is not large.

Clericals answered 27/11, 2017 at 8:23 Comment(0)

Bit-Mask Solution
Without any changes in your previous Database Design, just add 2 fields.
First: a long or bigint (related to your DBMS) into Workers
Second: another long or bigint into Work_Units

These fields show skills of work_units and skills of workers. For example suppose that you have 8 records in Skills table. (notice that records of skill in small)
1- some skill 1
2- some skill 2
...
8- some skill 8

then if we want to set skills 1,3,6,7 to one work_unit, just use this number 01100101.
(I offer to use reversed version of binary 0,1 placement to support additional skills in future.)

In practice you can use 10 base number to add in database (101 instead of 01100101)

Similar number can be generated to workers. Any worker choose some skills. So we can turn the selected items into a number and save it in additional field in Worker table.

Finally, to find appropriate work_units subset for any worker JUST select from work_units and use bitwise AND like below.
A: new_field_of_specific_worker (shows Skills of each Worker) that we are searching works_units related to him/her right now.
B: new_field_of_work_units that shows Skills of each work_unit

select * from work_units
where A & B  = B

Notice:
1: absolutely, this is fastest way but it has some difficulties.
2: we have some extra difficulties when a new skill is Added or to be Delete. But this is a trade-off. Adding or Deleting new skills less happens.
3: we should use skills and work_unit_skills and workers_skills too. But in search, we just use new fields

Also, this approach can be used for TAG Management systems like Stack Overflow TAGs.

Faceoff answered 27/11, 2017 at 10:3 Comment(0)

With Postgres, relational division can often be expressed more efficiently using arrays.

In your case I think the following will do what you want:

select *
from work_units
where id in (select work_unit_id
             from work_units_skills
             group by work_unit_id
             having array_agg(skill_id) <@ array(select skill_id 
                                                 from workers_skills 
                                                 where worker_id = 6))
and ... other conditions here ...
order by ...

array_agg(skill_id) collects all skill_ids for each work_unit and compares that with the skills of a specific worker using the <@ operator ("is contained by"). That condition returns all work_unit_ids where the list of skill_ids is contained in the skills for a single worker.

In my experience this approach is usually faster then equivalent exists or intersect solutions.

Online example: http://rextester.com/WUPA82849

Deferral answered 4/12, 2017 at 22:13 Comment(0)

With the current info I can only answer on a hunch. Try removing the EXCEPT-statement and see if it gets significantly faster. If it does, you can add that part again, but using WHERE-conditions. In my experience set operators (MINUS/EXCEPT, UNION, INTERSECT) are quite the performance killers.

Orison answered 27/11, 2017 at 8:23 Comment(0)

The correlated sub-query is punishing you, especially with the additional use of EXCEPT.

To paraphrase your query, you're only interested in a work_unit_id when a specified worker has ALL of that work_unit's skills? (If a work_unit has a skill associated with it, but the specified user doesn't have that skill, exclude that work_unit?)

This can be achieve with JOINs and GROUP BY, and no need for correlation at all.

SELECT
    work_units.*
FROM
    work_units
--
-- some joins
--
INNER JOIN
(
    SELECT
        wus.work_unit_id
    FROM
        work_unit_skills   wus
    LEFT JOIN
        workers_skills     ws
            ON  ws.skill_id  = wus.skill_id
            AND ws.worker_id = 1
    GROUP BY
        wus.work_unit_id
    HAVING
        COUNT(wus.skill_id) = COUNT(ws.skill_id)
)
     applicable_work_units
         ON  applicable_work_units.work_unit_id = work_units.id
-- AND a bunch of other conditions
-- ORDER BY something complex
LIMIT 1

The sub-query compares a worker's skill set to each work unit's skill set. If there are any skills the work unit has that the worker doesn't then ws.skill_id will be NULL for that row, and as NULL is ignored by COUNT() this means that COUNT(ws.skill_id) will then be lower than COUNT(wus.skill_id), and so that work_unit would become excluded from the sub-query's results.

This assumes that the workers_skills table is unique over (work_id, skill_id) and that the work_unit_skills table is unique over (work_unit_id, skill_id). If that's not the case then you may want to tinker with the HAVING clause (such as COUNT(DISTINT wus.skill_id), etc).

EDIT:

The above query assumes that only relatively low number of work units would match the criteria of matching a specific worker.

If you assume that a relatively large number of work units would match, the opposite logic would be faster.

(Essentially, try to make the number of rows returns by the sub-query as low as possible.)

SELECT
    work_units.*
FROM
    work_units
--
-- some joins
--
LEFT JOIN
(
    SELECT
        wus.work_unit_id
    FROM
        work_unit_skills   wus
    LEFT JOIN
        workers_skills     ws
            ON  ws.skill_id  = wus.skill_id
            AND ws.worker_id = 1
    WHERE
        ws.skill_id IS NULL
    GROUP BY
        wus.work_unit_id
)
     excluded_work_units
         ON  excluded_work_units.work_unit_id = work_units.id
WHERE
    excluded_work_units.work_unit_id IS NULL
-- AND a bunch of other conditions
-- ORDER BY something complex
LIMIT 1

This one compares all work unit skills with those of the worker, and only keeps rows where the work unit has skills that the worker does not have.

Then, GROUP BY the work unit to get a list of work units that need to be ignored.

By LEFT joining these on to your existing results, you can stipulate you only want to include a work unit if it doesn't appear in the sub-query by specifying excluded_work_units.work_unit_id IS NULL.

Useful online guides will refer to anti-join and anti-semi-join.

EDIT:

In general I would recommend against the use of a bit-mask.

Not because it's slow, but because it defies normalisation. The existence of a single field representing multiple items of data is a general sql-code-smell / sql-anti-pattern, as the data is no longer atomic. (This leads to pain down the road, especially if you reach a world where you have so many skills that they no longer all fit in to the data type chosen for the bit-mask, or when it comes to managing frequent or complex changes to the skill sets.)

That said, if performance continues to be an issue, de-normalisation is often a very useful option. I'd recommend keeping the bit masks in separate tables to make it clear that they're de-normalised / cached calcualtion results. In general though, such options should be a last resort rather than a first reaction.

EDIT: Example revisions to always include work_units that have no skills...

SELECT
    work_units.*
FROM
    work_units
--
-- some joins
--
INNER JOIN
(
    SELECT
        w.id   AS work_unit_id
    FROM
        work_units          w
    LEFT JOIN
        work_units_skills   wus
            ON wus.work_unit_id = w.id
    LEFT JOIN
        workers_skills      ws
            ON  ws.skill_id  = wus.skill_id
            AND ws.worker_id = 1
    GROUP BY
        w.id
    HAVING
        COUNT(wus.skill_id) = COUNT(ws.skill_id)
)
     applicable_work_units
         ON  applicable_work_units.work_unit_id = work_units.id

The excluded_work_units version of the code (the second example query above) should work without need for modification for this corner case (and is the one I'd initially trial for live performance metrics).

Leroi answered 27/11, 2017 at 19:8 Comment(11)

I think you miss the work_units that do not have skills assigned: dbfiddle.uk/… – Triceps 27/11, 2017 at 19:22

Indeed. Is that something the OP has stated as a requirement? Is an empty set considered a sub-set of a non-empty-set? It's been a long time since uni ;) Regardless, it's a simple change if that's what the OP requires... – Leroi 27/11, 2017 at 19:29

Amendments made, but where does it show that his query returns these? – Leroi 27/11, 2017 at 19:35

Bit-Mask defies normalization and causes SQL-code-smell / SQL-anti-pattern. It is true. But when our data is BIG, we can use some abnormal strategies to reach desire performance. All Big Data Technologies defy normalization. We should control this non-normalization by programming, not only by Database concepts and features. – Faceoff 27/11, 2017 at 20:37

@Radim Bača: Absolutely I did not downvoted this answer, I will never do something like this. (To prove it, I can downvote this answer when you want and upvote 1 hour again.) this answer is elegant and I am trying to learn something with this question and answers. – Faceoff 27/11, 2017 at 21:24

I would be interested to know who downvoted and why. But then in my years on SO I fear that I've seen troll votes becoming more and more common. – Leroi 28/11, 2017 at 0:33

@g.irani thanks for quoting only the part you take exception to. I would also suggest you could have quoted my repeated caveats such as in general and That said, if performance continues to be an issue, de-normalisation is often a very useful option. The small part of the schema shown by the Op suggest something more akin to Inmon than Kimball, certainly a strongly relational / structured schema. In that context I strongly stand by my position : In general though, such options should be a last resort rather than a first reaction. (Note that even that as an in general caveat.) – Leroi 28/11, 2017 at 0:37

@g.irani And be careful of categorical statements like All Big Data Technologies defy normalization. They're simply not true. For example, QlikView utilises a highly normalised data-lake model in order to minimise memory footprint. – Leroi 28/11, 2017 at 0:41

@MatBailie: You are right, I am so sorry. (Some Big Data Technologies defy normalization) – Faceoff 28/11, 2017 at 9:13

@MatBailie, an empty set is considered a subset of every set. I updated the question. Indeed, the join tables are unique on the combination of skill_id and (work_unit|worker)_id. I will test your solution, I see no reason why it was downvoted. – Reactor 28/11, 2017 at 9:45

@ndn - As per the details in the answer; if fewer than 50% of work_units tend to be excluded by this check, it's likely that the second example query would likely be the fastest of my suggestions. Conversely, if more than 50% of work_units tend to be excluded by this check, the third example query would likely be the fastest of my suggestions. – Leroi 28/11, 2017 at 13:10

You can get the work units covered by a worker's skills in an aggregation, as has been shown already. You'd typically use IN on this set of work units then.

SELECT wu.*
FROM work_units wu
-- some joins
WHERE wu.id IN
(
  SELECT wus.work_unit_id
  FROM work_units_skills wus
  LEFT JOIN workers_skills ws ON ws.skill_id = wus.skill_id AND ws.worker_id = 1
  GROUP BY wus.work_unit_id
  HAVING COUNT(*) = COUNT(ws.skill_id)
)
-- AND a bunch of other conditions
-- ORDER BY something complex
LIMIT 1
FOR UPDATE SKIP LOCKED;

When it comes to speeding up queries, the main part is often to provide the appropriate indexes, though. (With a perfect optimizer, re-writing a query to get the same result would have no effect at all, because the optimizer would get to the same execution plan.)

You want the following indexes (order of the columns matters):

create index idx_ws on workers_skills (worker_id, skill_id);
create index idx_wus on work_units_skills (skill_id, work_unit_id);

(Read it like this: We come with a worker_id, get the skill_ids for the worker, join work units on these skill_ids and get thus the work_unit_ids.)

Aeolian answered 28/11, 2017 at 12:27 Comment(2)

As per other similar answers, this needs a minor tweak to accommodate work units that have no skills associated with them. Basically, it's the same premise as I've used in my first suggestion, and @DanGetz used in his. – Leroi 28/11, 2017 at 12:40

I'd also suggest that the second index should be reversed. As it stands the skills can be joined optimally, but the a sort is needed for the aggregation. When reversed (to work_unit_id, skill_id) the data is already sorted ahead of the aggregation and the right hand table, having been reduced to one worker, is already exceptionally small so would be kept in memory easily. – Leroi 28/11, 2017 at 12:43

Might not apply to you, but I had a similar issue that I solved simply merging the main and sub into the same column using numbers for main and letters for sub.

Btw, are all columns involved in the joins indexed? My server goes from 2-3 sec query on 500k+ tables to crash on 10k tables if I forget

Ordinal answered 1/12, 2017 at 9:41 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags