Efficient way to select all values from one column not in another column
Asked Answered
T

3

16

I need to return all values from colA that are not in colB from mytable. I am using:

SELECT DISTINCT(colA) FROM mytable WHERE colA NOT IN (SELECT colB FROM mytable)

It is working however the query is taking an excessively long time to complete.

Is there a more efficient way to do this?

Threw answered 5/1, 2012 at 3:3 Comment(2)
NOT IN slows down as the set size grows, and there's often a limit to how many rows can be in the NOT IN clause. Outside of small result sets I've found it better to use other means to get the difference between two result sets.Ramonaramonda
When talking about performance, you must name your RDBMS or get suboptimal answers.Millhon
M
28

In standard SQL there are no parentheses in DISTINCT colA. DISTINCT is not a function.

SELECT DISTINCT colA
FROM   mytable
WHERE  colA NOT IN (SELECT DISTINCT colB FROM mytable);

Added DISTINCT to the sub-select as well. For many duplicates it can speed up the query. But, typically, there are more efficient query techniques than NOT IN for this:

The deciding factor for speed will be indexes. You need indexes on colA and colB for this query to be fast.

A CTE might be faster, depending on your DBMS. I additionally demonstrate LEFT JOIN as alternative to exclude the values in valB, and an alternative way to get distinct values with GROUP BY:

WITH x AS (SELECT colB FROM mytable GROUP BY colB)
SELECT m.colA
FROM   mytable m
LEFT   JOIN x ON x.colB = m.colA
WHERE  x.colB IS NULL
GROUP  BY m.colA;

Or, simplified further, and with a plain subquery (probably fastest):

SELECT DISTINCT m.colA
FROM   mytable m
LEFT   JOIN mytable x ON x.colB = m.colA
WHERE  x.colB IS NULL;
Millhon answered 5/1, 2012 at 3:7 Comment(6)
Thanks, I have tried: SELECT DISTINCT m1.colA FROM mytable m1 LEFT JOIN mytable m2 ON (m1.colA = m2.colB) WHERE m2.colA IS NULL ORDER BY m1.colA ASC and it is several orders of magnitude faster and seems to be working - is this equivalent to the code I posted in the question? It's so much quicker that I'm suspicious I might have missed something.Threw
@Andrew: Sorry, you got a typo in your query. Must be WHERE m2.colB IS NULL. The (corrected) query might be faster with LEFT JOIN (SELECT DISTINCT colB FROM mytable) m2 ON m2.colB = m1.colA if there are many duplicate values for colB.Millhon
@Andrew: m2.colA is always NULL if m2.colB is NULL here, but m2.colA can be NULL even if m2.colB is not. So the correct (and faster!) form here is: WHERE m2.colB IS NULL. If colA is defined NOT NULL, then your above query is correct.Millhon
Thanks, the queries are giving me the same number of results I suppose because there are no NULL values. Do you know why it is so much faster?Threw
@Andrew: Different query plans perform differently. I can't say much without knowing which RDBMS you are using. You might want to edit that information into your question. Information about indexes would be instrumental, too.Millhon
Ok, many thanks. I made the DISTINCT change you suggested and it now completes in a few seconds.Threw
G
6

You can use exists:

select distinct
    colA
from
    mytable m1
where
    not exists (select 1 from mytable m2 where m2.colB = m1.colA)

exists does a semi-join to quickly match the values. not in completes the entire result set and then does an or on it. exists is typically faster for values in tables.

Gev answered 5/1, 2012 at 3:5 Comment(1)
@Andrew - Sure! It says, grab the distinct colAs where there's no row from mytable that colB is equal to that colA.Gev
R
0

You can use the EXCEPT operator which effectively diffs two SELECT queries. EXCEPT DISTINCT will return only unique values. Oracle's MINUS operator is equivalent to EXCEPT DISTINCT.

Ramonaramonda answered 5/1, 2012 at 3:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.