Using Postgres, I want to perform a full text search that includes an array column, using an index. Let's start with a hypothetical schema:
CREATE TABLE book (title TEXT, tags TEXT[]);
-- tags are lowercase a-z, dashes, and $
We want a query that searches title and tags for some text. A straightforward query with reasonable semantics, recommended by many SO answers, is:
SELECT *
FROM book
WHERE to_tsvector('simple', array_to_string(tags, ' ')) || to_tsvector('simple', title)
@@ to_tsquery('simple', 'mysearchterm');
That works. The dashes and dollar signs in tags effectively no longer exist but that's ok for this application. However, we have millions of records and need an index:
CREATE INDEX book_fulltext_idx
ON book using GIN
((to_tsvector('simple', array_to_string(tags, ' ')) || to_tsvector('simple', title)));
Uh oh! This fails because array_to_string
is not IMMUTABLE. There are SO answers that suggest wrapping array_to_string in an immutable function:
CREATE FUNCTION my_array_to_string(arr ANYARRAY, sep TEXT)
RETURNS text LANGUAGE SQL IMMUTABLE
AS $$
SELECT array_to_string(arr, sep);
$$;
CREATE INDEX book_fulltext_idx
ON book using GIN
((to_tsvector('simple', my_array_to_string(tags, ' ')) || to_tsvector('simple', title)));
The index create works! But it never gets used. EXPLAIN ANALYZE on the above SELECT always results in a sequential scan. Postgres is apparently too smart for this trickery.
Aggregate (cost=4348818.79..4348818.80 rows=1 width=8) (actual time=107489.124..107489.125 rows=1 loops=1)
-> Seq Scan on book (cost=0.00..4348543.45 rows=110135 width=0) (actual time=50.689..107477.408 rows=24641 loops=1)
Filter: ((to_tsvector('simple'::regconfig, my_array_to_string(tags, ' '::text)) || to_tsvector('simple'::regconfig, title)) @@ '''mysearchterm'''::tsquery)"
Rows Removed by Filter: 5354819
Planning Time: 0.144 ms
Execution Time: 107489.157 ms
I'm stumped. Is there some way of improving this?
New strategy: Use array_to_tsvector
.
CREATE INDEX book_fulltext_idx
ON book using GIN
((array_to_tsvector(tags) || to_tsvector('simple', title)));
SELECT *
FROM book
WHERE array_to_tsvector(tags) || to_tsvector('simple', title)
@@ to_tsquery('simple', 'mysearchterm');
This works! The index is used! It's fast!
Bitmap Heap Scan on book (cost=2005.04..76150.11 rows=26973 width=147) (actual time=5.281..425.128 rows=946 loops=1)
Recheck Cond: ((array_to_tsvector(tags) || to_tsvector('simple'::regconfig, title)) @@ '''apple'''::tsquery)"
Heap Blocks: exact=790
-> Bitmap Index Scan on book_fulltext_idx (cost=0.00..1998.30 rows=26973 width=0) (actual time=4.468..4.468 rows=957 loops=1)
Index Cond: ((array_to_tsvector(tags) || to_tsvector('simple'::regconfig, title)) @@ '''mysearchterm'''::tsquery)"
Planning Time: 0.113 ms
Execution Time: 425.371 ms
BUT the semantics of the search are problematic. array_to_tsvector
interprets the array as raw lexemes. to_tsquery
strips out $ and dashes. That means tags with dollar signs or dashes are unsearchable.
-- This can NEVER match the tag `$mysearchterm`
SELECT *
FROM book
WHERE array_to_tsvector(tags) || to_tsvector('simple', title)
@@ to_tsquery('simple', '$mysearchterm');
Is there some way to make this do what I want? Seems like I want something like array_to_tsvector('simple', tags)
but that function does not exist.
New strategy: Two indexes and OR
CREATE INDEX book_tags_fulltext_idx
ON book using GIN (array_to_tsvector(tags));
CREATE INDEX book_title_fulltext_idx
ON book using GIN (to_tsvector('simple', title));
SELECT *
FROM book
WHERE array_to_tsvector(tags) @@ '$mysearchterm' OR to_tsvector('simple', title)
@@ to_tsquery('simple', '$mysearchterm');
This produces the right answer in a reasonable amount of time, but breaks the semantics of search. You can't search for titleword tagword
. The WHERE clause requires both words in the tags, or both words in the title. No bueno.
Epilogue
It looks like I either need to figure out how to index the tags array concatenated with the title, or somehow munge the values that are passed to array_to_tsvector. I'm not quite sure how to do either of those things. Any ideas?
We're using PG11, but I can upgrade if it will make a difference.