Best way to use PostgreSQL full text search ranking
Asked Answered
C

3

23

Following on from this answer I want to know what the best way to use PostgreSQL's built-in full text search is if I want to sort by rank, and limit to only matching queries.

Let's assume a very simple table.

CREATE TABLE pictures (
  id SERIAL PRIMARY KEY,
  title varchar(300),
  ...
)

or whatever. Now I want to search the title field. First I create an index:

CREATE INDEX pictures_title ON pictures 
  USING gin(to_tsvector('english', title));

Now I want to search for 'small dog'. This works:

SELECT pictures.id, 
       ts_rank_cd(
         to_tsvector('english', pictures.title), 'small dog'
       ) AS score
FROM pictures
ORDER BY score DESC

But what I really want is this:

SELECT pictures.id, 
       ts_rank_cd(
         to_tsvector('english', pictures.title), to_tsquery('small dog')
       ) AS score
FROM pictures 
WHERE to_tsvector('english', pictures.title) @@ to_tsquery('small dog')
ORDER BY score DESC

Or alternatively this (which doesn't work - can't use score in the WHERE clause):

SELECT pictures.id, 
       ts_rank_cd(
         to_tsvector('english', pictures.title), to_tsquery('small dog')
       ) AS score
FROM pictures WHERE score > 0
ORDER BY score DESC

What's the best way to do this? My questions are many-fold:

  1. If I use the version with repeated to_tsvector(...) will it call that twice, or is it smart enough to cache the results somehow?
  2. Is there a way to do it without repeating the to_ts... function calls?
  3. Is there a way to use score in the WHERE clause at all?
  4. If there is, would it be better to filter by score > 0 or use the @@ thing?
Caftan answered 17/10, 2012 at 11:58 Comment(0)
R
29

The use of the @@ operator will utilize the full text GIN index, while the test for score > 0 would not.

I created a table as in the Question, but added a column named title_tsv:

CREATE TABLE test_pictures (
  id        BIGSERIAL,
  title     text,
  title_tsv tsvector
);

CREATE INDEX ix_pictures_title_tsv ON test_pictures 
    USING gin(title_tsv);

I populated the table with some test data:

INSERT INTO test_pictures(title, title_tsv)
SELECT T.data, to_tsvector(T.data) 
FROM   some_table T;

Then I ran the previously accepted answer with explain analyze:

EXPLAIN ANALYZE 
SELECT  score, id, title
FROM (
    SELECT ts_rank_cd(P.title_tsv, to_tsquery('address & shipping')) AS score
        ,P.id        
        ,P.title
    FROM test_pictures as P
) S
WHERE score > 0
ORDER BY score DESC;

And got the following. Note the execution time of 5,015 ms

QUERY PLAN                                                                                                                                    |
----------------------------------------------------------------------------------------------------------------------------------------------|
Gather Merge  (cost=274895.48..323298.03 rows=414850 width=60) (actual time=5010.844..5011.330 rows=1477 loops=1)                             |
  Workers Planned: 2                                                                                                                          |
  Workers Launched: 2                                                                                                                         |
  ->  Sort  (cost=273895.46..274414.02 rows=207425 width=60) (actual time=4994.539..4994.555 rows=492 loops=3)                                |
        Sort Key: (ts_rank_cd(p.title_tsv, to_tsquery('address & shipping'::text))) DESC                                                      |
        Sort Method: quicksort  Memory: 131kB                                                                                                 |
        ->  Parallel Seq Scan on test_pictures p  (cost=0.00..247776.02 rows=207425 width=60) (actual time=17.672..4993.997 rows=492 loops=3) |
              Filter: (ts_rank_cd(title_tsv, to_tsquery('address & shipping'::text)) > '0'::double precision)                                 |
              Rows Removed by Filter: 497296                                                                                                  |
Planning time: 0.159 ms                                                                                                                       |
Execution time: 5015.664 ms                                                                                                                   |

Now compare that with the @@ operator:

EXPLAIN ANALYZE
SELECT ts_rank_cd(to_tsvector(P.title), to_tsquery('address & shipping')) AS score
    ,P.id
    ,P.title
FROM    test_pictures as P
WHERE P.title_tsv @@ to_tsquery('address & shipping')
ORDER BY score DESC;

And the results coming in with an execution time of about 29 ms:

QUERY PLAN                                                                                                                                       |
-------------------------------------------------------------------------------------------------------------------------------------------------|
Gather Merge  (cost=13884.42..14288.35 rows=3462 width=60) (actual time=26.472..26.942 rows=1477 loops=1)                                        |
  Workers Planned: 2                                                                                                                             |
  Workers Launched: 2                                                                                                                            |
  ->  Sort  (cost=12884.40..12888.73 rows=1731 width=60) (actual time=17.507..17.524 rows=492 loops=3)                                           |
        Sort Key: (ts_rank_cd(to_tsvector(title), to_tsquery('address & shipping'::text))) DESC                                                  |
        Sort Method: quicksort  Memory: 171kB                                                                                                    |
        ->  Parallel Bitmap Heap Scan on test_pictures p  (cost=72.45..12791.29 rows=1731 width=60) (actual time=1.781..17.268 rows=492 loops=3) |
              Recheck Cond: (title_tsv @@ to_tsquery('address & shipping'::text))                                                                |
              Heap Blocks: exact=625                                                                                                             |
              ->  Bitmap Index Scan on ix_pictures_title_tsv  (cost=0.00..71.41 rows=4155 width=0) (actual time=3.765..3.765 rows=1477 loops=1)  |
                    Index Cond: (title_tsv @@ to_tsquery('address & shipping'::text))                                                            |
Planning time: 0.214 ms                                                                                                                          |
Execution time: 28.995 ms                                                                                                                        |

As you can see in the execution plan, the index ix_pictures_title_tsv was used in the second query, but not in the first one, making the query with the @@ operator a whopping 172 times faster!

Ripplet answered 5/7, 2018 at 20:21 Comment(0)
D
7
select *
from (
    SELECT
        pictures.id,
        ts_rank_cd(to_tsvector('english', pictures.title), 
        to_tsquery('small dog')) AS score
    FROM pictures
) s
WHERE score > 0
ORDER BY score DESC
Douglass answered 17/10, 2012 at 12:18 Comment(1)
This query does not use the index. Also, to_tsquery('small dog') throws an error. You probably meant to_tsquery('small & dog').Ripplet
S
4

If I use the version with repeated to_tsvector(...) will it call that twice, or is it smart enough to cache the results somehow?

The best way to notice these things is to do a simple explain, although those can be hard to read.

Long story short, yes, PostgreSQL is smart enough to reuse computed results.

Is there a way to do it without repeating the to_ts... function calls?

What I usually do is add a tsv column which is the text search vector. If you make this auto update using triggers it immediately gives you the vector easily accessible but it also allows you to selectively update the search index by making the trigger selective.

Is there a way to use score in the WHERE clause at all?

Yes, but not with that name. Alternatively you could create a sub-query, but I would personally just repeat it.

If there is, would it be better to filter by score > 0 or use the @@ thing?

The simplest version I can think of is this:

SELECT *
FROM pictures
WHERE 'small dog' @@ text_search_vector

The text_search_vector could obviously be replaced with something like to_tsvector('english', pictures.title)

Swingletree answered 19/10, 2012 at 0:57 Comment(1)
This answer ignores what seemse to be the most important part of the question : "ranking"Moseley

© 2022 - 2024 — McMap. All rights reserved.