The correlated sub-query is punishing you, especially with the additional use of EXCEPT.
To paraphrase your query, you're only interested in a work_unit_id
when a specified worker has ALL of that work_unit's skills? (If a work_unit has a skill associated with it, but the specified user doesn't have that skill, exclude that work_unit?)
This can be achieve with JOINs and GROUP BY, and no need for correlation at all.
SELECT
work_units.*
FROM
work_units
--
-- some joins
--
INNER JOIN
(
SELECT
wus.work_unit_id
FROM
work_unit_skills wus
LEFT JOIN
workers_skills ws
ON ws.skill_id = wus.skill_id
AND ws.worker_id = 1
GROUP BY
wus.work_unit_id
HAVING
COUNT(wus.skill_id) = COUNT(ws.skill_id)
)
applicable_work_units
ON applicable_work_units.work_unit_id = work_units.id
-- AND a bunch of other conditions
-- ORDER BY something complex
LIMIT 1
The sub-query compares a worker's skill set to each work unit's skill set. If there are any skills the work unit has that the worker doesn't then ws.skill_id
will be NULL
for that row, and as NULL
is ignored by COUNT()
this means that COUNT(ws.skill_id)
will then be lower than COUNT(wus.skill_id)
, and so that work_unit
would become excluded from the sub-query's results.
This assumes that the workers_skills
table is unique over (work_id, skill_id)
and that the work_unit_skills
table is unique over (work_unit_id, skill_id)
. If that's not the case then you may want to tinker with the HAVING
clause (such as COUNT(DISTINT wus.skill_id)
, etc).
EDIT:
The above query assumes that only relatively low number of work units would match the criteria of matching a specific worker.
If you assume that a relatively large number of work units would match, the opposite logic would be faster.
(Essentially, try to make the number of rows returns by the sub-query as low as possible.)
SELECT
work_units.*
FROM
work_units
--
-- some joins
--
LEFT JOIN
(
SELECT
wus.work_unit_id
FROM
work_unit_skills wus
LEFT JOIN
workers_skills ws
ON ws.skill_id = wus.skill_id
AND ws.worker_id = 1
WHERE
ws.skill_id IS NULL
GROUP BY
wus.work_unit_id
)
excluded_work_units
ON excluded_work_units.work_unit_id = work_units.id
WHERE
excluded_work_units.work_unit_id IS NULL
-- AND a bunch of other conditions
-- ORDER BY something complex
LIMIT 1
This one compares all work unit skills with those of the worker, and only keeps rows where the work unit has skills that the worker does not have.
Then, GROUP BY
the work unit to get a list of work units that need to be ignored.
By LEFT
joining these on to your existing results, you can stipulate you only want to include a work unit if it doesn't appear in the sub-query by specifying excluded_work_units.work_unit_id IS NULL
.
Useful online guides will refer to anti-join
and anti-semi-join
.
EDIT:
In general I would recommend against the use of a bit-mask.
Not because it's slow, but because it defies normalisation. The existence of a single field representing multiple items of data is a general sql-code-smell / sql-anti-pattern, as the data is no longer atomic. (This leads to pain down the road, especially if you reach a world where you have so many skills that they no longer all fit in to the data type chosen for the bit-mask, or when it comes to managing frequent or complex changes to the skill sets.)
That said, if performance continues to be an issue, de-normalisation is often a very useful option. I'd recommend keeping the bit masks in separate tables to make it clear that they're de-normalised / cached calcualtion results. In general though, such options should be a last resort rather than a first reaction.
EDIT: Example revisions to always include work_units that have no skills...
SELECT
work_units.*
FROM
work_units
--
-- some joins
--
INNER JOIN
(
SELECT
w.id AS work_unit_id
FROM
work_units w
LEFT JOIN
work_units_skills wus
ON wus.work_unit_id = w.id
LEFT JOIN
workers_skills ws
ON ws.skill_id = wus.skill_id
AND ws.worker_id = 1
GROUP BY
w.id
HAVING
COUNT(wus.skill_id) = COUNT(ws.skill_id)
)
applicable_work_units
ON applicable_work_units.work_unit_id = work_units.id
The excluded_work_units
version of the code (the second example query above) should work without need for modification for this corner case (and is the one I'd initially trial for live performance metrics).
ORDER BY something complex bunch of conditions
etc). So, If you could post theEXPLAIN
, it might be helpful – AllergenEXPLAIN
as the project is private, but I can answer your questions if you have any hunches. – Reactorexplain (analyze, verbose, buffers)
. Formatted text please, no screen shots. If you don't want to (or can't) share table names, upload it to explain.depesz.com and enable the option to obfuscate the plan (although an execution plan seldomly reveals anything that is confidential) – Deferral