In PostgreSQL, how can one efficiently do a full text search on one column, sorting on another column?
Say I have a table tbl
with columns a
, b
, c
, ... and many (> a million) rows. I want to do a full text search on column a
and sort the results by some other column.
So I create a tsvector va
from column a
,
ALTER TABLE tbl
ADD COLUMN va tsvector GENERATED ALWAYS AS (to_tsvector('english', a)) STORED;
create an index iva
for that,
CREATE INDEX iva ON tbl USING GIN (va);
and an index ib
for column b
,
CREATE INDEX ib ON tbl (b);
Then I query like
SELECT * FROM tbl WHERE va @@ to_tsquery('english', 'test') ORDER BY b LIMIT 100
Now the obvious execution strategy for Postgres would be:
for frequent words to do an Index Scan using
ib
, filtering forva @@ 'test'::tsquery
, and stopping after 100 matches,while for rare words to do a (Bitmap) Index Scan using
iva
with conditionva @@ 'test'::tsquery
, and then to sort onb
manually
However, Postgres' query planner seems not to take word frequency into account:
With a low
LIMIT
(e.g.100
) it always uses strategy 1 (as I checked withEXPLAIN
), and in my case takes over a minute for rare (or not occurring) words. However, if I trick it into using strategy 2 by setting a large (or no)LIMIT
, it returns in a millisecond!The other way round, with a larger
LIMIT
(e.g.200
) it always uses strategy 2 which works well for rare words but is very slow for frequent words
So how do I get Postgres to use a good query plan in every case?
Since there seems currently no way to let Postgres to choose the right plan automatically,
how do I get the number of rows containing a specific lexeme so I can decide on the best strategy?
(
SELECT COUNT(*) FROM tbl WHERE va @@ to_tsquery('english', 'test')
is horribly slow (~ 1 second for lexemes occurring in 10000 rows), andts_stat
seems also not to help, apart from building my own word frequency list)how do I then tell Postgres to use this strategy?
Here a concrete example
I have a table items
with 1.5 million rows, with a tsvector column v3
on which I do the text search, and a column rating
on which I sort. In this case I determined the query planner always chooses strategy 1 if the LIMIT is 135 or less, else strategy 2
Here the EXPLAIN ANALYZE for the rare word 'aberdeen' (occurring in 132 rows) with LIMIT 135:
EXPLAIN (ANALYZE, BUFFERS) SELECT nm FROM items WHERE v3 @@ to_tsquery('english', 'aberdeen')
ORDER BY rating DESC NULLS LAST LIMIT 135
Limit (cost=0.43..26412.78 rows=135 width=28) (actual time=5915.455..499917.390 rows=132 loops=1)
Buffers: shared hit=4444267 read=2219412
I/O Timings: read=485517.381
-> Index Scan using ir on items (cost=0.43..1429202.13 rows=7305 width=28) (actual time=5915.453..499917.242 rows=132 loops=1)
Filter: (v3 @@ '''aberdeen'''::tsquery)"
Rows Removed by Filter: 1460845
Buffers: shared hit=4444267 read=2219412
I/O Timings: read=485517.381
Planning:
Buffers: shared hit=253
Planning Time: 1.270 ms
Execution Time: 499919.196 ms
and with LIMIT 136:
EXPLAIN (ANALYZE, BUFFERS) SELECT nm FROM items WHERE v3 @@ to_tsquery('english', 'aberdeen')
ORDER BY rating DESC NULLS LAST LIMIT 136
Limit (cost=26245.53..26245.87 rows=136 width=28) (actual time=29.870..29.889 rows=132 loops=1)
Buffers: shared hit=57 read=83
I/O Timings: read=29.085
-> Sort (cost=26245.53..26263.79 rows=7305 width=28) (actual time=29.868..29.876 rows=132 loops=1)
Sort Key: rating DESC NULLS LAST
Sort Method: quicksort Memory: 34kB
Buffers: shared hit=57 read=83
I/O Timings: read=29.085
-> Bitmap Heap Scan on items (cost=88.61..25950.14 rows=7305 width=28) (actual time=1.361..29.792 rows=132 loops=1)
Recheck Cond: (v3 @@ '''aberdeen'''::tsquery)"
Heap Blocks: exact=132
Buffers: shared hit=54 read=83
I/O Timings: read=29.085
-> Bitmap Index Scan on iv3 (cost=0.00..86.79 rows=7305 width=0) (actual time=1.345..1.345 rows=132 loops=1)
Index Cond: (v3 @@ '''aberdeen'''::tsquery)"
Buffers: shared hit=3 read=2
I/O Timings: read=1.299
Planning:
Buffers: shared hit=253
Planning Time: 1.296 ms
Execution Time: 29.932 ms
and here for the frequent word 'game' (occurring in 240464 rows) with LIMIT 135:
EXPLAIN (ANALYZE, BUFFERS) SELECT nm FROM items WHERE v3 @@ to_tsquery('english', 'game')
ORDER BY rating DESC NULLS LAST LIMIT 135
Limit (cost=0.43..26412.78 rows=135 width=28) (actual time=3.240..542.252 rows=135 loops=1)
Buffers: shared hit=2876 read=1930
I/O Timings: read=529.523
-> Index Scan using ir on items (cost=0.43..1429202.13 rows=7305 width=28) (actual time=3.239..542.216 rows=135 loops=1)
Filter: (v3 @@ '''game'''::tsquery)
Rows Removed by Filter: 867
Buffers: shared hit=2876 read=1930
I/O Timings: read=529.523
Planning:
Buffers: shared hit=208 read=45
I/O Timings: read=15.626
Planning Time: 25.174 ms
Execution Time: 542.306 ms
and with LIMIT 136:
EXPLAIN (ANALYZE, BUFFERS) SELECT nm FROM items WHERE v3 @@ to_tsquery('english', 'game')
ORDER BY rating DESC NULLS LAST LIMIT 136
Limit (cost=26245.53..26245.87 rows=136 width=28) (actual time=69419.656..69419.675 rows=136 loops=1)
Buffers: shared hit=1757820 read=457619
I/O Timings: read=65246.893
-> Sort (cost=26245.53..26263.79 rows=7305 width=28) (actual time=69419.654..69419.662 rows=136 loops=1)
Sort Key: rating DESC NULLS LAST
Sort Method: top-N heapsort Memory: 41kB
Buffers: shared hit=1757820 read=457619
I/O Timings: read=65246.893
-> Bitmap Heap Scan on items (cost=88.61..25950.14 rows=7305 width=28) (actual time=110.959..69326.343 rows=240464 loops=1)
Recheck Cond: (v3 @@ '''game'''::tsquery)
Rows Removed by Index Recheck: 394527
Heap Blocks: exact=49894 lossy=132284
Buffers: shared hit=1757817 read=457619
I/O Timings: read=65246.893
-> Bitmap Index Scan on iv3 (cost=0.00..86.79 rows=7305 width=0) (actual time=100.537..100.538 rows=240464 loops=1)
Index Cond: (v3 @@ '''game'''::tsquery)
Buffers: shared hit=1 read=60
I/O Timings: read=26.870
Planning:
Buffers: shared hit=253
Planning Time: 1.195 ms
Execution Time: 69420.399 ms