I'm having an issue where a query that's indexed refuses to use the index, because it's not selective enough (let's say 60 out of 130 million rows meet the condition) and so decides to use a seqscan.
The issue I'm facing is that the seqscan is really not the best choice in this case, for some reason it gets a really good score, but the truth is that the seqscan only runs fast if it was queried before and it can load everything from buffers/cache.
The index scan might be slightly slower when compared to the seqscan if BOTH of them are on the buffers, but this rarely happens and when both queries are cold, index scan is still way faster (ms vs seconds).
Note that index scan is superior because I'm using a limit clause so it should be able to pick up those few rows very quickly.
I have set the statistics to value to 1000 (defaults 100) and vacuumed just in case, but same story.
TLDR: Seq scan vs index scan on low selective index, seqscan is preferred but planner is wrong, seqscan is only better if it's cached otherwise it's way worse.
Query and plans, note that the index one was loaded from buffers while the seqscan wasn't completely.
explain (analyze, buffers)
select *
from identities_identity
where email_domain = 'live.com'
limit 100
'Limit (cost=0.00..63.50 rows=100 width=573) (actual time=75215.573..75215.640 rows=100 loops=1)'
' Buffers: shared hit=75113 read=588870'
' -> Seq Scan on identities_identity (cost=0.00..2980008.00 rows=4692733 width=573) (actual time=75215.571..75215.604 rows=100 loops=1)'
' Filter: ((email_domain)::text = 'live.com'::text)'
' Rows Removed by Filter: 54464136'
' Buffers: shared hit=75113 read=588870'
'Planning time: 0.097 ms'
'Execution time: 75215.675 ms'
'Limit (cost=0.57..187.26 rows=100 width=573) (actual time=0.027..0.090 rows=100 loops=1)'
' Buffers: shared hit=6'
' -> Index Scan using identities_identity_email_domain_9056bd28 on identities_identity (cost=0.57..8760978.66 rows=4692733 width=573) (actual time=0.026..0.057 rows=100 loops=1)'
' Index Cond: ((email_domain)::text = 'live.com'::text)'
' Buffers: shared hit=6'
'Planning time: 0.078 ms'
'Execution time: 0.124 ms'
UPDATE:
Table def (indexes on email, and email_domain, both a standard and a varchar_pattern_ops one)
CREATE TABLE public.identities_identity
(
id bigint NOT NULL DEFAULT nextval('identities_identity_id_seq'::regclass),
email character varying(1000) COLLATE pg_catalog."default",
email_domain character varying(1000) COLLATE pg_catalog."default",
leak_id bigint NOT NULL,
CONSTRAINT identities_identity_pkey PRIMARY KEY (id),
CONSTRAINT identities_identity_leak_id_87e1ae4e_fk_identities_leak_id FOREIGN KEY (leak_id)
REFERENCES public.identities_leak (id) MATCH SIMPLE
ON UPDATE NO ACTION
ON DELETE NO ACTION
DEFERRABLE INITIALLY DEFERRED
)
Table stats (after vacuum analyze
attname, avg_width, n_distinct, correlation
'id',8,'-1','0.999988'
'email',23,'-0.636853','-0.020479'
'email_domain',10,'3876','0.696452'
'leak_id',8,'1','1'
set enable_seqscan = false;
? postgresql.org/docs/current/static/runtime-config-query.html – Soukupenable_seqscan=off
for this one query... – Yonneemail_domain
? 3) you have rather fat rows, even for the index.] – Encampmentemail_domain
,or evenemail->email_domain
(there must be a transitive dependancy here) into a separate table, and referring to it via an (integer)FK (needs a supporting index) will force usage of the index(es) It will also reduce your rowsizes -> buffer footprint. – Encampment