Efficient full text search in PostgreSQL, sorting on another column
Asked Answered
I

2

5

In PostgreSQL, how can one efficiently do a full text search on one column, sorting on another column?

Say I have a table tbl with columns a, b, c, ... and many (> a million) rows. I want to do a full text search on column a and sort the results by some other column.

So I create a tsvector va from column a,

ALTER TABLE tbl
ADD COLUMN va tsvector GENERATED ALWAYS AS (to_tsvector('english', a)) STORED;

create an index iva for that,

CREATE INDEX iva ON tbl USING GIN (va);

and an index ib for column b,

CREATE INDEX ib ON tbl (b);

Then I query like

SELECT * FROM tbl WHERE va @@ to_tsquery('english', 'test') ORDER BY b LIMIT 100

Now the obvious execution strategy for Postgres would be:

  1. for frequent words to do an Index Scan using ib, filtering for va @@ 'test'::tsquery, and stopping after 100 matches,

  2. while for rare words to do a (Bitmap) Index Scan using iva with condition va @@ 'test'::tsquery, and then to sort on b manually

However, Postgres' query planner seems not to take word frequency into account:

  • With a low LIMIT (e.g. 100) it always uses strategy 1 (as I checked with EXPLAIN), and in my case takes over a minute for rare (or not occurring) words. However, if I trick it into using strategy 2 by setting a large (or no) LIMIT, it returns in a millisecond!

  • The other way round, with a larger LIMIT (e.g. 200) it always uses strategy 2 which works well for rare words but is very slow for frequent words

So how do I get Postgres to use a good query plan in every case?

Since there seems currently no way to let Postgres to choose the right plan automatically,

  • how do I get the number of rows containing a specific lexeme so I can decide on the best strategy?

    (SELECT COUNT(*) FROM tbl WHERE va @@ to_tsquery('english', 'test') is horribly slow (~ 1 second for lexemes occurring in 10000 rows), and ts_stat seems also not to help, apart from building my own word frequency list)

  • how do I then tell Postgres to use this strategy?


Here a concrete example

I have a table items with 1.5 million rows, with a tsvector column v3 on which I do the text search, and a column rating on which I sort. In this case I determined the query planner always chooses strategy 1 if the LIMIT is 135 or less, else strategy 2

Here the EXPLAIN ANALYZE for the rare word 'aberdeen' (occurring in 132 rows) with LIMIT 135:

EXPLAIN (ANALYZE, BUFFERS) SELECT nm FROM items WHERE v3 @@ to_tsquery('english', 'aberdeen')
  ORDER BY rating DESC NULLS LAST LIMIT 135

Limit  (cost=0.43..26412.78 rows=135 width=28) (actual time=5915.455..499917.390 rows=132 loops=1)
  Buffers: shared hit=4444267 read=2219412
  I/O Timings: read=485517.381
  ->  Index Scan using ir on items  (cost=0.43..1429202.13 rows=7305 width=28) (actual time=5915.453..499917.242 rows=132 loops=1)
        Filter: (v3 @@ '''aberdeen'''::tsquery)"
        Rows Removed by Filter: 1460845
        Buffers: shared hit=4444267 read=2219412
        I/O Timings: read=485517.381
Planning:
  Buffers: shared hit=253
Planning Time: 1.270 ms
Execution Time: 499919.196 ms

and with LIMIT 136:

EXPLAIN (ANALYZE, BUFFERS) SELECT nm FROM items WHERE v3 @@ to_tsquery('english', 'aberdeen')
  ORDER BY rating DESC NULLS LAST LIMIT 136

Limit  (cost=26245.53..26245.87 rows=136 width=28) (actual time=29.870..29.889 rows=132 loops=1)
  Buffers: shared hit=57 read=83
  I/O Timings: read=29.085
  ->  Sort  (cost=26245.53..26263.79 rows=7305 width=28) (actual time=29.868..29.876 rows=132 loops=1)
        Sort Key: rating DESC NULLS LAST
        Sort Method: quicksort  Memory: 34kB
        Buffers: shared hit=57 read=83
        I/O Timings: read=29.085
        ->  Bitmap Heap Scan on items  (cost=88.61..25950.14 rows=7305 width=28) (actual time=1.361..29.792 rows=132 loops=1)
              Recheck Cond: (v3 @@ '''aberdeen'''::tsquery)"
              Heap Blocks: exact=132
              Buffers: shared hit=54 read=83
              I/O Timings: read=29.085
              ->  Bitmap Index Scan on iv3  (cost=0.00..86.79 rows=7305 width=0) (actual time=1.345..1.345 rows=132 loops=1)
                    Index Cond: (v3 @@ '''aberdeen'''::tsquery)"
                    Buffers: shared hit=3 read=2
                    I/O Timings: read=1.299
Planning:
  Buffers: shared hit=253
Planning Time: 1.296 ms
Execution Time: 29.932 ms

and here for the frequent word 'game' (occurring in 240464 rows) with LIMIT 135:

EXPLAIN (ANALYZE, BUFFERS) SELECT nm FROM items WHERE v3 @@ to_tsquery('english', 'game')
  ORDER BY rating DESC NULLS LAST LIMIT 135

Limit  (cost=0.43..26412.78 rows=135 width=28) (actual time=3.240..542.252 rows=135 loops=1)
  Buffers: shared hit=2876 read=1930
  I/O Timings: read=529.523
  ->  Index Scan using ir on items  (cost=0.43..1429202.13 rows=7305 width=28) (actual time=3.239..542.216 rows=135 loops=1)
        Filter: (v3 @@ '''game'''::tsquery)
        Rows Removed by Filter: 867
        Buffers: shared hit=2876 read=1930
        I/O Timings: read=529.523
Planning:
  Buffers: shared hit=208 read=45
  I/O Timings: read=15.626
Planning Time: 25.174 ms
Execution Time: 542.306 ms

and with LIMIT 136:

EXPLAIN (ANALYZE, BUFFERS) SELECT nm FROM items WHERE v3 @@ to_tsquery('english', 'game')
  ORDER BY rating DESC NULLS LAST LIMIT 136
  
Limit  (cost=26245.53..26245.87 rows=136 width=28) (actual time=69419.656..69419.675 rows=136 loops=1)
  Buffers: shared hit=1757820 read=457619
  I/O Timings: read=65246.893
  ->  Sort  (cost=26245.53..26263.79 rows=7305 width=28) (actual time=69419.654..69419.662 rows=136 loops=1)
        Sort Key: rating DESC NULLS LAST
        Sort Method: top-N heapsort  Memory: 41kB
        Buffers: shared hit=1757820 read=457619
        I/O Timings: read=65246.893
        ->  Bitmap Heap Scan on items  (cost=88.61..25950.14 rows=7305 width=28) (actual time=110.959..69326.343 rows=240464 loops=1)
              Recheck Cond: (v3 @@ '''game'''::tsquery)
              Rows Removed by Index Recheck: 394527
              Heap Blocks: exact=49894 lossy=132284
              Buffers: shared hit=1757817 read=457619
              I/O Timings: read=65246.893
              ->  Bitmap Index Scan on iv3  (cost=0.00..86.79 rows=7305 width=0) (actual time=100.537..100.538 rows=240464 loops=1)
                    Index Cond: (v3 @@ '''game'''::tsquery)
                    Buffers: shared hit=1 read=60
                    I/O Timings: read=26.870
Planning:
  Buffers: shared hit=253
Planning Time: 1.195 ms
Execution Time: 69420.399 ms
Idolism answered 6/11, 2021 at 1:56 Comment(5)
What version? The planner is not static.Mccomas
Postgres 13, what do you mean with "not static"?Idolism
not static = gets improved from version to version.Mccomas
It seems to do a good job to me. Can you show us concrete EXPLAIN (ANALYZE, BUFFERS) from concrete cases, and replace '> a million', 'rare', and 'frequent' with quantitative values? Turn on track_io_timing first if possible.Mccomas
@Mccomas I have now given a concrete example as you asked forIdolism
E
7

This is not easy to solve: full text search requires a GIN index, but a GIN index cannot support ORDER BY. Also, if you have a B-tree index for ORDER BY and a GIN index for the full text search, these can be combined using a bitmap index scan, but a bitmap index scan cannot support ORDER BY either.

I see a certain possibility if you create your own “stop word” list that contains all the frequent words in your data (in addition to the normal English stop words). Then you can define a text search dictionary that uses that stop word file and a text search configuration english_rare using that dictionary.

Then you could create your full text index using that configuration and query in two steps like this:

  1. look for rare words:

    SELECT *
    FROM (SELECT *
          FROM tbl
          WHERE va @@ to_tsquery('english_rare', 'test')
          OFFSET 0) AS q
    ORDER BY b LIMIT 100;
    

    The subquery with OFFSET 0 will keep the optimizer from scanning the index on b.

    For rare words, this will return the correct result quickly. For frequent words, this will return nothing, since to_tsquery will return an empty result. To distinguish between a miss because the word does not occur and a miss because the word is frequent, watch for the following notice:

    NOTICE:  text-search query contains only stop words or doesn't contain lexemes, ignored
    
  2. look for frequent words (if the first query gave you the notice):

    SELECT *
    FROM (SELECT *
          FROM tbl
          ORDER BY b) AS q
    WHERE va @@ to_tsquery('english', 'test')
    LIMIT 100;
    

    Note that we use the normal English configuration here. This will always scan the index on b and will be reasonably fast for frequent search terms.

Earp answered 27/1, 2022 at 13:43 Comment(5)
Thanks for the answer! It seems far better than my first idea (launch two concurrent queries and kill the one that takes time as soon as the other answers...). My concern now is that the "frequent words" dictionary may be quite big depending on the thresold i choose for frequent. I do my tests and come back to youTorpor
Yes, composing that file is the Achilles' heel of my idea.Earp
Maybe another point of attention is that for the first query, if searching for multiple words, OR operator is mandatory. It raises some interesting questions :).Torpor
A search string with several words is interpreted as "and", and that would be no problem. But if the search string contains an |, that would be a problem for my strategy. Either forbid that, or split the search string and treat each element separately.Earp
@Vincent, I have now reported my experienceIdolism
I
3

Solution for my scenario, which I think will work well for many real-world cases:

Let Postgres use the "rare-word strategy" (2. in the question) always or mostly. The reason is that there always should be the possibility for a user to sort by relevance (e.g. using ts_rank), in which case the other strategy cannot be used, so one has to make sure that the "rare-word strategy" is fast enough for all searches anyway.

To force Postgres to use this strategy one can use a subquery, as Laurenz Albe has pointed out:

SELECT * FROM 
    (SELECT * FROM tbl WHERE va @@ to_tsquery('english', 'test') OFFSET 0) AS q
ORDER BY b LIMIT 100;

Alternatively one can simply set LIMIT sufficiently high (while only fetching as many results as needed).

I could achieve sufficient performance (nearly all queries take < 1 second) by

  • doing each search first against a smaller ts_vector containing the most relevant parts of each document (e.g. title and summary), and checking the full document only if this first query yields not enough results.
  • specially treating words occurring very frequently, e.g. only allowing them in AND-combination with other words (adding them to the stop words is problematic since those are not sensibly treated when occurring in phrases for example)
  • increasing RAM and increasing shared_buffers so the whole table can be cached (8.5 GB for me currently)

For cases where these optimizations are not enough, to achieve better performance for all queries (i.e. also those sorting by relevance, which are the hardest), I think one would have to use a more sophisticated text search index instead of GIN. There is the RUM index extension which looks promising, but I haven't tried it yet.


PS: Contrary to my observation in the question I have now found that under certain circumstances the planner does take word frequency into account and makes decisions in the right direction:

For rare words the borderline LIMIT above which it chooses the "rare-word strategy" is lower than for frequent words, and in a certain range this choice seems very good. However this is in no way reliable and sometimes the choice is very wrong, e.g. for low LIMITs it chooses the "frequent-word strategy" also for very rare or non-occurring words which leads to awful slowness.

It appears to depend on many factors and seems not predictable.

Idolism answered 2/2, 2022 at 2:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.