Postgres full text search on array column, with index
Asked Answered
O

1

10

Using Postgres, I want to perform a full text search that includes an array column, using an index. Let's start with a hypothetical schema:

CREATE TABLE book (title TEXT, tags TEXT[]);
-- tags are lowercase a-z, dashes, and $

We want a query that searches title and tags for some text. A straightforward query with reasonable semantics, recommended by many SO answers, is:

SELECT * 
FROM book 
WHERE to_tsvector('simple', array_to_string(tags, ' ')) || to_tsvector('simple', title)
      @@ to_tsquery('simple', 'mysearchterm'); 

That works. The dashes and dollar signs in tags effectively no longer exist but that's ok for this application. However, we have millions of records and need an index:

CREATE INDEX book_fulltext_idx
    ON book using GIN 
        ((to_tsvector('simple', array_to_string(tags, ' ')) || to_tsvector('simple', title)));

Uh oh! This fails because array_to_string is not IMMUTABLE. There are SO answers that suggest wrapping array_to_string in an immutable function:

CREATE FUNCTION my_array_to_string(arr ANYARRAY, sep TEXT) 
RETURNS text LANGUAGE SQL IMMUTABLE
AS $$
    SELECT array_to_string(arr, sep);
$$;

CREATE INDEX book_fulltext_idx
    ON book using GIN 
        ((to_tsvector('simple', my_array_to_string(tags, ' ')) || to_tsvector('simple', title)));

The index create works! But it never gets used. EXPLAIN ANALYZE on the above SELECT always results in a sequential scan. Postgres is apparently too smart for this trickery.

Aggregate  (cost=4348818.79..4348818.80 rows=1 width=8) (actual time=107489.124..107489.125 rows=1 loops=1)
  ->  Seq Scan on book  (cost=0.00..4348543.45 rows=110135 width=0) (actual time=50.689..107477.408 rows=24641 loops=1)
        Filter: ((to_tsvector('simple'::regconfig, my_array_to_string(tags, ' '::text)) || to_tsvector('simple'::regconfig, title)) @@ '''mysearchterm'''::tsquery)"
        Rows Removed by Filter: 5354819
Planning Time: 0.144 ms
Execution Time: 107489.157 ms

I'm stumped. Is there some way of improving this?


New strategy: Use array_to_tsvector.

CREATE INDEX book_fulltext_idx
    ON book using GIN 
        ((array_to_tsvector(tags) || to_tsvector('simple', title)));

SELECT * 
FROM book 
WHERE array_to_tsvector(tags) || to_tsvector('simple', title)
      @@ to_tsquery('simple', 'mysearchterm'); 

This works! The index is used! It's fast!

Bitmap Heap Scan on book  (cost=2005.04..76150.11 rows=26973 width=147) (actual time=5.281..425.128 rows=946 loops=1)
  Recheck Cond: ((array_to_tsvector(tags) || to_tsvector('simple'::regconfig, title)) @@ '''apple'''::tsquery)"
  Heap Blocks: exact=790
  ->  Bitmap Index Scan on book_fulltext_idx  (cost=0.00..1998.30 rows=26973 width=0) (actual time=4.468..4.468 rows=957 loops=1)
        Index Cond: ((array_to_tsvector(tags) || to_tsvector('simple'::regconfig, title)) @@ '''mysearchterm'''::tsquery)"
Planning Time: 0.113 ms
Execution Time: 425.371 ms

BUT the semantics of the search are problematic. array_to_tsvector interprets the array as raw lexemes. to_tsquery strips out $ and dashes. That means tags with dollar signs or dashes are unsearchable.

-- This can NEVER match the tag `$mysearchterm`
SELECT * 
FROM book 
WHERE array_to_tsvector(tags) || to_tsvector('simple', title)
      @@ to_tsquery('simple', '$mysearchterm'); 

Is there some way to make this do what I want? Seems like I want something like array_to_tsvector('simple', tags) but that function does not exist.


New strategy: Two indexes and OR

CREATE INDEX book_tags_fulltext_idx
    ON book using GIN (array_to_tsvector(tags));

CREATE INDEX book_title_fulltext_idx
    ON book using GIN (to_tsvector('simple', title));

SELECT * 
FROM book 
WHERE array_to_tsvector(tags) @@ '$mysearchterm' OR to_tsvector('simple', title)
      @@ to_tsquery('simple', '$mysearchterm'); 

This produces the right answer in a reasonable amount of time, but breaks the semantics of search. You can't search for titleword tagword. The WHERE clause requires both words in the tags, or both words in the title. No bueno.


Epilogue

It looks like I either need to figure out how to index the tags array concatenated with the title, or somehow munge the values that are passed to array_to_tsvector. I'm not quite sure how to do either of those things. Any ideas?

We're using PG11, but I can upgrade if it will make a difference.

Occasionally answered 4/3, 2022 at 17:53 Comment(4)
Your plan shows an aggregate node, but your query does not use aggregation. Please show the true plan and true query.Gd
Your definition of my_array_to_string should restrict itself to text[] input, not anyarray. That way it is less likely to get misused. Functions that lie about their immutability should be handled with care.Gd
You first formulation with my_array_to_string works for me. It is willing to use the index (for example, with enable_seqscan=off), but might choose not to if it thinks too many rows will match compared to the size of the table. So with there you have just a planning problem.Gd
I tried with enable_seqscan=off and couldn't get it to use the index. However, look at my answer - rearranging the query & index seems to have worked. I have no explanation for why.Occasionally
O
3

I found a solution that got the index working. I can't explain it.

This does NOT work:

CREATE INDEX book_fulltext_idx
    ON book using GIN 
        ((to_tsvector('simple', immutable_array_to_string(tags, ' ')) || to_tsvector('simple', title)));

SELECT * 
FROM book 
WHERE to_tsvector('simple', immutable_array_to_string(tags, ' ')) || to_tsvector('simple', title)
      @@ to_tsquery('simple', 'mysearchterm'); 

This however DOES work:

CREATE INDEX book_fulltext_idx
    ON book using GIN (to_tsvector('simple', title || ' ' || immutable_array_to_string(tags, ' ')));

SELECT * 
FROM book 
WHERE to_tsvector('simple', title || ' ' || immutable_array_to_string(tags, ' '))
      @@ to_tsquery('simple', 'mysearchterm'); 
Aggregate  (cost=81092.50..81092.51 rows=1 width=8) (actual time=129.780..129.781 rows=1 loops=1)
  ->  Bitmap Heap Scan on book  (cost=296.49..81025.24 rows=26902 width=0) (actual time=1.990..129.519 rows=1576 loops=1)
        Recheck Cond: (to_tsvector('simple'::regconfig, ((title || ' '::text) || immutable_array_to_string(tags, ' '::text))) @@ '''mysearchterm'''::tsquery)
        Heap Blocks: exact=1302
        ->  Bitmap Index Scan on book_fulltext_idx  (cost=0.00..289.76 rows=26902 width=0) (actual time=1.605..1.606 rows=1576 loops=1)
              Index Cond: (to_tsvector('simple'::regconfig, ((title || ' '::text) || immutable_array_to_string(tags, ' '::text))) @@ '''mysearchterm'''::tsquery)
Planning Time: 0.509 ms
Execution Time: 129.906 ms

I cannot explain why the query planner thinks "take vector of concatenated strings" is different from "concatenate vectors of strings" but there you have it.

Occasionally answered 4/3, 2022 at 22:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.