NOT IN query... odd results
Asked Answered
G

4

3

I need a list of users in one database that are not listed as the new_user_id in another. There are 112,815 matching users in both databases; user_id is the key in all queries tables.


Query #1 works, and gives me 111,327 users who are NOT referenced as a new_user_Id. But it requires querying the same data twice.

-- 111,327 GSU users are NOT listed as a CSS new user 
--   1,488 GSU users ARE listed as a new user in CSS
--
select count(gup.user_id)
from   gsu.user_profile gup
  join (select cud.user_id, cud.new_user_id, cud.user_type_code
        from   css.user_desc cud) cudsubq
    on gup.user_id = cudsubq.user_id
where  gup.user_id not in (select cud.new_user_id 
                           from   css.user_desc cud
                           where  cud.new_user_id is not null);


Query #2 would be perfect... and I'm actually surprised that it's syntactically accepted. But it gives me a result that makes no sense.

-- This gives me 1,505 users... I've checked, and they are not
-- referenced as new_user_ids in CSS, but I don't know why the ones 
-- that were excluded were excluded.
--
-- Where are the missing 109,822, and whatexcluded them?
-- 
select count(gup.user_id)
from   gsu.user_profile gup
  join (select cud.user_id, cud.new_user_id, cud.user_type_code
        from   css.user_desc cud) cudsubq
    on gup.user_id = cudsubq.user_id
where  gup.user_id not in (cudsubq.new_user_id);


What exactly is the where clause in the second query doing, and why is it excluding 109,822 records from the results?


Note The above query is a simplification of what I'm really after. There are other/better ways to do the above queries... they're just representative of the part of the query that's giving me problems.

Gabble answered 19/10, 2012 at 21:15 Comment(4)
Do both tables have nulls in the columns that is used for joining? I guess yes, from the results you have.Pleadings
No, they are joined on the primary key.Gabble
@ypercube I think the problem is the use of the NOT IN, cudsubq.new_user_id can be NULL. Read my answer to the question.Chin
so you can't change the code to use MINUS?Mantissa
C
4

Read this: http://asktom.oracle.com/pls/asktom/f?p=100:11:0::NO::P11_QUESTION_ID:442029737684

For what I understand, your cudsubq.new_user_id can be NULL even though both tables are joined by user_id, so, you won't get results using the NOT IN operator when the subset contains NULL values . Consider the example in the article:

select * from dual where dummy not in ( NULL )

This returns no records. Try using the NOT EXISTS operator or just another kind of join. Here is a good source: http://www.codinghorror.com/blog/2007/10/a-visual-explanation-of-sql-joins.html

And what you need is the fourth example:

SELECT COUNT(descr.user_id)
FROM 
    user_profile prof
    LEFT OUTER JOIN user_desc descr
        ON prof.user_id = descr.user_id 
WHERE descr.new_user_id IS NULL
    OR descr.new_user_id != prof.user_id
Chin answered 29/10, 2012 at 18:6 Comment(0)
A
1

Second query is semantically different. In this case

where  gup.user_id not in (cudsubq.new_user_id)

cudsubq.new_user_id is treated as expression (doc: IN condition), not as a subquery, thus the whole clause is basically equivalent to

where  gup.user_id != cudsubq.new_user_id

So, in your first query, you're literally asking "show me all users in GUP, who also have entries in CSS and their GUP.ID is not matching ANY NOT NULL NEW_ID in CSS ".

However, the second query is "show me all users in GUP, who also have entries in CSS and their GUP.ID is not equal to their RESPECTIVE NULLABLE (no is not null clause, remember?) CSS.NEW_ID value".

And any (not) in (or equality/inequality) checks with nulls don't actually work.

12:07:54 SYSTEM@oars_sandbox> select * from dual where 1 not in (null, 2, 3, 4);

no rows selected                                                   

Elapsed: 00:00:00.00          

This is where you lose your rows. I would probably rewrite your second query's where clause as where cudsubq.new_user_id is null, assuming that non-matching users have null new_user_id.

Anderlecht answered 30/10, 2012 at 4:10 Comment(0)
T
0

Your second select compares gup.user_id with cud.new_user_id on current joining record. You can rewrite the query to get the same result

select count(gup.user_id)
from   gsu.user_profile gup
  join (select cud.user_id, cud.new_user_id, cud.user_type_code
        from   css.user_desc cud) cudsubq
    on gup.user_id = cudsubq.user_id
where  gup.user_id != cud.new_user_id or cud.new_user_id is null;

You mentioned you compare list of user in one database with a list of users in another. So you need to query data twice and you don't query the same data. Maybe you can use "minus" operator to avoid using "in"

select count(gup.user_id)
from   gsu.user_profile gup
  join (select cud.user_id from css.user_desc cud
        minus
        select cud.new_user_id from css.user_desc cud) cudsubq
    on gup.user_id = cudsubq.user_id;
Toft answered 24/10, 2012 at 4:32 Comment(2)
That makes sense; I got thrown because when I inverted the query, the numbers didn't add up... looking at them again, it's definitely the nulls that are throwing off the comparisons. As for the second query you have, are there any execution advantages over the first query I wrote? I'm trying to avoid querying the same data twice. The real query I'm working on takes 15-20 minutes to run... adding a not-in select to filter against the same data doubles the execution time : (Gabble
"NOT IN" is often expensive operation, because optimizer cannot use indexes. Server must for every record evaluate not in clause and its subselect. Minus clause in this case allows you to prepare user ids in one step and combine them with records from user_profile. I expect it will give better result. Of course it depends of your database schema. To analyze select execution try EXPLAIN PLAN command.Toft
R
0

You want new_user_id's from table gup that don't match any new_user_id on table cud, right? It sounds like a job for a left join:

SELECT count(gup.user_id)
    FROM gsu.user_profile gup LEFT JOIN css.user_desc cud
        ON gup.user_id = cud.new_user_id
    WHERE cud.new_user_id is NULL

The join keeps all rows of gup, matching them with a new_user_id if possible. The WHERE condition keeps only the rows that have no matching row in cud.

(Apologies if you know this already and you're only interested in the behavior of the not in query)

Roose answered 29/10, 2012 at 21:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.