PostgreSQL Full Text Search: why search is sooo slow?
Asked Answered
A

3

6

I have a small PostgreSQL database (~~3,000 rows).

I'm trying to set up a full text search on one of it's text fields ('body').

The problem is that any query is extremely slow (35+ seconds!!!).

I suppose the problem comes from the fact that the DB chooses a sequential scan mode...

This is my query:

    SELECT
        ts_rank_cd(to_tsvector('italian', body), query),
        ts_headline('italian', body, to_tsquery('torino')),
        title,
        location,
        id_author
    FROM
        fulltextsearch.documents, to_tsquery('torino') as query
    WHERE
        (body_tsvector @@ query)
    OFFSET
        0

This is the EXPLAIN ANALYZE:

                                      QUERY PLAN                                    
----------------------------------------------------------------------------------------------------------------------------
Limit  (cost=0.00..1129.81 rows=19 width=468) (actual time=74.059..13630.114 rows=863 loops=1)
->  Nested Loop  (cost=0.00..1129.81 rows=19 width=468) (actual time=74.056..13629.342 rows=863 loops=1)
     Join Filter: (documents.body_tsvector @@ query.query)
     ->  Function Scan on to_tsquery query  (cost=0.00..0.01 rows=1 width=32) (actual time=4.606..4.608 rows=1 loops=1)
     ->  Seq Scan on documents  (cost=0.00..1082.09 rows=3809 width=591) (actual time=0.045..48.072 rows=3809 loops=1)
Total runtime: 13630.720 ms

This is my table:

mydb=# \d+ fulltextsearch.documents;
                                              Table "fulltextsearch.documents"
    Column     |       Type        |                               Modifiers                               | Storage  | Description
---------------+-------------------+-----------------------------------------------------------------------+----------+-------------
 id            | integer           | not null default nextval('fulltextsearch.documents_id_seq'::regclass) | plain    |
 id_author     | integer           |                                                                       | plain    |
 body          | character varying |                                                                       | extended |
 title         | character varying |                                                                       | extended |
 location      | character varying |                                                                       | extended |
 date_creation | date              |                                                                       | plain    |
 body_tsvector | tsvector          |                                                                       | extended |
Indexes:
    "fulltextsearch_documents_tsvector_idx" gin (to_tsvector('italian'::regconfig,     COALESCE(body, ''::character varying)::text))
    "id_idx" btree (id)
Triggers:
    body_tsvectorupdate BEFORE INSERT OR UPDATE ON fulltextsearch.documents FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger('body_tsvector', 'pg_catalog.italian', 'body')
Has OIDs: no

I'm sure I'm missing something obvious....

Any clues?

.

.

.

=== UPDATE =======================================================================

Thanks to your suggestions, I came up with this (better) query:

SELECT
    ts_rank(body_tsvector, query),
    ts_headline('italian', body, query),
    title,
    location
FROM
    fulltextsearch.documents, to_tsquery('italian', 'torino') as query
WHERE
    to_tsvector('italian', coalesce(body,'')) @@ query

which is quite better, but always very slow (13+ seconds...).

I notice that commenting out the "ts_headline()" row the query is lightning-fast.

This is the EXPLAIN ANALYZE, which finally uses the index, but doesn't help me much...:

EXPLAIN ANALYZE SELECT
clock_timestamp() - statement_timestamp() as elapsed_time,
    ts_rank(body_tsvector, query),
    ts_headline('italian', body, query),
    title,
    location
FROM
    fulltextsearch.documents, to_tsquery('italian', 'torino') as query
WHERE
    to_tsvector('italian', coalesce(body,'')) @@ query

 Nested Loop  (cost=16.15..85.04 rows=19 width=605) (actual time=102.290..13392.161 rows=863 loops=1)
   ->  Function Scan on query  (cost=0.00..0.01 rows=1 width=32) (actual time=0.008..0.009 rows=1 loops=1)
   ->  Bitmap Heap Scan on documents  (cost=16.15..84.65 rows=19 width=573) (actual time=0.381..4.236 rows=863 loops=1)
         Recheck Cond: (to_tsvector('italian'::regconfig, (COALESCE(body, ''::character varying))::text) @@ query.query)
         ->  Bitmap Index Scan on fulltextsearch_documents_tsvector_idx  (cost=0.00..16.15 rows=19 width=0) (actual time=0.312..0.312 rows=863 loops=1)
               Index Cond: (to_tsvector('italian'::regconfig, (COALESCE(body, ''::character varying))::text) @@ query.query)
 Total runtime: 13392.717 ms
Adopt answered 18/4, 2013 at 14:58 Comment(0)
B
5

You're missing two (reasonably obvious) things:

1 You've set 'italian' in your to_tsvector() but you aren't specifying it in to_tsquery()

Keep both consistent.

2 You've indexed COALESCE(body, ...) but that isn't what you're searching against.

The planner isn't magic - you can only use an index if that's what you're searching against.

Burgeon answered 18/4, 2013 at 16:12 Comment(3)
The planner is simple and conservative in index choice. It won't look up the default tsearch2 language to see whether that's italian and then treat the language-unspecified version the same as the language-specified-as-italian version, it needs both function calls to be the same. In fact, in this regard the planner can be a little dim - last time I checked there were quite a few simple variations on how you could write an expression - like redundant parentheses - that would cause the planner not to recognise that it matched an index.Fromm
@Richard Huxton Thanks for your answer. By "searching against" you mean the FROM clause? If so, could you please post a sample of a correct FROM clause for my situation? Sorry, I'm yet quite confused...Adopt
postgresql.org/docs/current/static/… "ts_headline uses the original document, not a tsvector summary, so it can be slow and should be used with care..."Burgeon
A
2

At last, with the help of your answers and comments, and with some googling, I did solve by running ts_headline() (a very heavy function, I suppose) on a subset of the full result set (the results page I'm interested in):

    SELECT
        id,
        ts_headline('italian', body, to_tsquery('italian', 'torino')) as headline,
        rank,
        title,
        location
    FROM (
        SELECT
            id,
            body,
            title,
            location,
            ts_rank(body_tsvector, query) as rank
        FROM
            fulltextsearch.documents, to_tsquery('italian', 'torino') as query
        WHERE
            to_tsvector('italian', coalesce(body,'')) @@ query
        LIMIT 10
        OFFSET 0
    ) as s
Adopt answered 19/4, 2013 at 14:31 Comment(0)
B
0

I solved the problem by precomputing the ts_rank_cd and storing it in a table for popular terms(high occurrences) in the corpus. The search looks at this table to get the sorted doc rank for a query term. if not there (for less popular terms), it will default to creating the ts_rank_cd on the fly.

Please take a look at this post.

https://dba.stackexchange.com/a/149701

Brew answered 15/9, 2016 at 13:40 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.