Postgresql huge performance difference when using IN vs NOT IN
Asked Answered
S

1

6

I have 2 tables, "transaksi" and "buku". "transaksi" has around ~250k rows, and buku has around ~170k rows. Both tables have column called "k999a", and both tables use no indexes. Now I check these 2 statements.

Statement 1:

explain select k999a from transaksi where k999a not in (select k999a from buku);

Statement 1 outputs:

 Seq Scan on transaksi  (cost=0.00..721109017.46 rows=125426 width=9)
   Filter: (NOT (SubPlan 1))
   SubPlan 1
     ->  Materialize  (cost=0.00..5321.60 rows=171040 width=8)
           ->  Seq Scan on buku  (cost=0.00..3797.40 rows=171040 width=8)

Statement 2:

explain select k999a from transaksi where k999a in (select k999a from buku);

Statement 2 outputs:

Hash Semi Join  (cost=6604.40..22664.82 rows=250853 width=9)
   Hash Cond: (transaksi.k999a = buku.k999a)
   ->  Seq Scan on transaksi  (cost=0.00..6356.53 rows=250853 width=9)
   ->  Hash  (cost=3797.40..3797.40 rows=171040 width=8)
         ->  Seq Scan on buku  (cost=0.00..3797.40 rows=171040 width=8)

Why in the NOT IN query, postgresql does loop join, making the query takes a long time?

PS: postgresql version 9.6.1 on windows 10

Sanguinary answered 4/10, 2018 at 3:30 Comment(7)
why no indexes?Breastplate
I don't know why Postgres chose to hash buku in one case, and materialize buku in memory in the other case. The bottom line is that if you really wanted these queries to run fast, you'd index k999a in the buku table.Frasquito
@TimBiegeleisen What difference make an index on buku? He is doing a full scan of that table anyway.Nebulose
@JuanCarlosOropeza Who says we do a full scan on that table if there is a B-tree built around k999a? Are you saying Postgres would not take advantage of this index, if it existed?Frasquito
Im saying on first select (select k999a from buku) you get a materialize table and you need the whole table to check NOT IN so in that case an index doesnt help. On the second one even when also a select looks like planner do a SEMI JOIN in that case the index would helpNebulose
@Used_By_Already a dummy table for learning, wanna check the entire plan 1stSanguinary
@JuanCarlosOropeza just tried using index on buku.k999a, no difference on planSanguinary
C
14

This is to be expected. You may get better performance using WHERE NOT EXISTS instead:

SELECT k999a
FROM transaksi
WHERE NOT EXISTS (
    SELECT 1 FROM buku WHERE buku.k999a = transaksi.k999a LIMIT 1
);

Here is a good explanation as to why for each of the methods: https://explainextended.com/2009/09/16/not-in-vs-not-exists-vs-left-join-is-null-postgresql/

Culet answered 4/10, 2018 at 3:34 Comment(4)
@TimBiegeleisen - I can tell you that is certainly not the case in PostgreSQL. EXISTS and IN are often the same, but not true for anti-joinsCulet
@TimBiegeleisen - it is possible only when subselect result should not be NULL. PostgreSQL cannot to ensure this behave, and then NOT IN cannot be translated to antijoin. Other databases maybe can ensure this behave or their implementation is not semantically correct.Deltadeltaic
@Nicarus so basically, the problem is caused by the implementation of NOT IN by postgres? at least that's what I get from the link you gaveSanguinary
@Culet Given that the article is very old, is this still true today?Trussing

© 2022 - 2024 — McMap. All rights reserved.