Postgresql index seq scan 100 million rows

Asked 17/10, 2017 at 22:46 Answered 19/10, 2017 at 13:35

Solved postgresql performance indexing sql-execution-plan seq

I'm having an issue where a query that's indexed refuses to use the index, because it's not selective enough (let's say 60 out of 130 million rows meet the condition) and so decides to use a seqscan.

The issue I'm facing is that the seqscan is really not the best choice in this case, for some reason it gets a really good score, but the truth is that the seqscan only runs fast if it was queried before and it can load everything from buffers/cache.

The index scan might be slightly slower when compared to the seqscan if BOTH of them are on the buffers, but this rarely happens and when both queries are cold, index scan is still way faster (ms vs seconds).

Note that index scan is superior because I'm using a limit clause so it should be able to pick up those few rows very quickly.

I have set the statistics to value to 1000 (defaults 100) and vacuumed just in case, but same story.

TLDR: Seq scan vs index scan on low selective index, seqscan is preferred but planner is wrong, seqscan is only better if it's cached otherwise it's way worse.

Query and plans, note that the index one was loaded from buffers while the seqscan wasn't completely.

explain (analyze, buffers)
select *
from identities_identity
where email_domain = 'live.com'
limit 100


'Limit  (cost=0.00..63.50 rows=100 width=573) (actual time=75215.573..75215.640 rows=100 loops=1)'
'  Buffers: shared hit=75113 read=588870'
'  ->  Seq Scan on identities_identity  (cost=0.00..2980008.00 rows=4692733 width=573) (actual time=75215.571..75215.604 rows=100 loops=1)'
'        Filter: ((email_domain)::text = 'live.com'::text)'
'        Rows Removed by Filter: 54464136'
'        Buffers: shared hit=75113 read=588870'
'Planning time: 0.097 ms'
'Execution time: 75215.675 ms'


'Limit  (cost=0.57..187.26 rows=100 width=573) (actual time=0.027..0.090 rows=100 loops=1)'
'  Buffers: shared hit=6'
'  ->  Index Scan using identities_identity_email_domain_9056bd28 on identities_identity  (cost=0.57..8760978.66 rows=4692733 width=573) (actual time=0.026..0.057 rows=100 loops=1)'
'        Index Cond: ((email_domain)::text = 'live.com'::text)'
'        Buffers: shared hit=6'
'Planning time: 0.078 ms'
'Execution time: 0.124 ms'

UPDATE:

Table def (indexes on email, and email_domain, both a standard and a varchar_pattern_ops one)

CREATE TABLE public.identities_identity
(
    id bigint NOT NULL DEFAULT nextval('identities_identity_id_seq'::regclass),
    email character varying(1000) COLLATE pg_catalog."default",
    email_domain character varying(1000) COLLATE pg_catalog."default",
    leak_id bigint NOT NULL,
    CONSTRAINT identities_identity_pkey PRIMARY KEY (id),
    CONSTRAINT identities_identity_leak_id_87e1ae4e_fk_identities_leak_id FOREIGN KEY (leak_id)
        REFERENCES public.identities_leak (id) MATCH SIMPLE
        ON UPDATE NO ACTION
        ON DELETE NO ACTION
        DEFERRABLE INITIALLY DEFERRED
)

Table stats (after vacuum analyze

attname, avg_width, n_distinct, correlation
'id',8,'-1','0.999988'
'email',23,'-0.636853','-0.020479'
'email_domain',10,'3876','0.696452'
'leak_id',8,'1','1'

Pardew answered 17/10, 2017 at 22:46 Comment(11)

It's funny that even using random_page_cost=0.05 (a nonsense value) it will still select a seqscan for some of the largest queries like 'gmail' although it will switch to an index scan for some smaller indexes like 'live.com'. But still, the index with cold buffers takes a few ms while the seqscan can take even minutes, something is really bad here. – Pardew 18/10, 2017 at 1:26

I'm wondering that the issue is related to the fact that the data is probably not evenly distributed since it was inserted per domain (ie all the gmail.com ones first, then live.com, etc..) – Pardew 18/10, 2017 at 2:8

Have you tried set enable_seqscan = false;? postgresql.org/docs/current/static/runtime-config-query.html – Soukup 18/10, 2017 at 6:22

The problem is clearly related to the data distribution - PostgreSQL thinks that it will get the first 100 matching rows quickly in a sequential scan, but that is not the case. How many rows does the table contain? – Yonne 18/10, 2017 at 7:9

DaveGray yes, that's the way I get the index scan plan, otherwise it will always use seq scan. @LaurenzAlbe that's what I thought, shouldn't postgres be smarter? I have tried to vacum analyze with set statistics to 10000 but the results are the same. What are the options to fix this? Both the data insertion and hence the auto ids order pretty much has those ~50 million rows grouped together, should I try to CLUSTER on some other index? how to randomly distribute the data otherwise? – Pardew 18/10, 2017 at 12:17

Can't think of a better way than setting enable_seqscan=off for this one query... – Yonne 18/10, 2017 at 12:47

There has to be a way to re-distribute the data? Or at least some statistics that should hint postgres is making a really bad choice with the seqscan. Using enable_seqscan=off is quite problematic specially if using an ORM. – Pardew 18/10, 2017 at 13:5

0) table definitions+ statistics,please! 1) Did you ANALYZE after adding the index? 2) What is the cardinality of email_domain ? 3) you have rather fat rows, even for the index.] – Encampment 18/10, 2017 at 14:23

@Encampment see updates. Yes I did analyze. So far the issue seems related to the data distribution that's not uniform at all but rather large groups distributed together so a seqscan from the beginning will always be bad if the group is at the end of the disk, but postgres should know that. – Pardew 18/10, 2017 at 14:34

Normalize: Squeezing out email_domain ,or even email->email_domain(there must be a transitive dependancy here) into a separate table, and referring to it via an (integer)FK (needs a supporting index) will force usage of the index(es) It will also reduce your rowsizes -> buffer footprint. – Encampment 18/10, 2017 at 15:7

Although normalization might work, there are a few issues with it since there are some other columns I didn't include that might be queried in the same way, I would probably need to normalize them all. And the queries are 99% of the time really fast and good, it's just these border cases that the query planner gets crazy, I would expect there's a way around to fix the data distribution issue. – Pardew 18/10, 2017 at 15:43

Well, the solution was to re-order the data physically so sequential scans for those special cases wouldn't fail.

Basically, run a CLUSTER identities_identity USING index_name; on a column that makes the data evenly distributed (such as the value before the email domain).

Sequential scans now run just fine, even with cold buffers.

However, @Laurenz Albe answer is quite good for the specific case I posted and a nice trick to have if clustering wouldn't be possible.

Pardew answered 19/10, 2017 at 13:35 Comment(0)

You could use a mean trick to force an index scan:

SELECT *
FROM identities_identity
WHERE email_domain IN ('live.com', NULL)
ORDER BY email_domain
LIMIT 100;

If PostgreSQL has to sort, using the index will always be cheaper.

If you have WHERE email_domain = 'live.com', PostgreSQL is smart enough to know it doesn't have to sort, that's why I added a second useless item to fool it.

Yonne answered 19/10, 2017 at 4:11 Comment(3)

That's a nice trick to force the index and it seems to work! However, what happens if there are also LIKE 'xxx%' queries? I also have an index with varchar_pattern_ops for those and it suffers from the exact same issue, and the above solution can't really work since the order_by prevents the pattern ops index from being used. I guess I should have mentioned that on the question as well – Pardew 19/10, 2017 at 12:47

That's different. Did you try WHERE email_domain LIKE 'live%' AND email_domain ~>=~ 'live'? – Yonne 19/10, 2017 at 13:45

It will still chose a seqscan with that one. It's ok, it really seems postgres is picking the right plan as long as the data is evenly distributed. – Pardew 19/10, 2017 at 13:51

Well, the solution was to re-order the data physically so sequential scans for those special cases wouldn't fail.

Basically, run a CLUSTER identities_identity USING index_name; on a column that makes the data evenly distributed (such as the value before the email domain).

Sequential scans now run just fine, even with cold buffers.

However, @Laurenz Albe answer is quite good for the specific case I posted and a nice trick to have if clustering wouldn't be possible.

Pardew answered 19/10, 2017 at 13:35 Comment(0)

Recommended topics

Hot tags