Possible reason to include a tsvector field?
Asked Answered
N

1

6

I'm (re)confused about tsvector fields. It's very common to see examples of text being concatenated into a tsvector field, and then indexed. If you look up PG 12 generated columns, you'll find examples showing tsvector field population quickly.

Since the only purpose of this field is to support the index, why not build the index with an expression and skip the vector column? I asked about this in passing previous (can't find it), and remember hearing back "the column is pointless." Which makes sense to me. Recently, I've been following some comments on this article:

Fine Tuning Full Text Search with PostgreSQL 12

The piece is quite good, BTW. There's a comments like so:

One reason to materialize tsvector is not to evaluate expression at recheck, which could be quite expensive itself.

Check ‘Recheck Cond’ in explain analyze, the condition could be really checked, hence the condition (to_tsvector) will be calculated.

\d t1
Table “public.t1”
Column | Type    | Collation | Nullable | Default
--–----+---------+----------–+----------+---
id     | integer |           |          |
t      | text    |           |          |
Indexes:
  "t1_to_tsvector_idx" gin (to_tsvector('english'::regconfig, t))

explain analyze select * from t1 where to_tsvector(‘english’, t) @@ to_tsquery(‘english’,’foo’);
QUERY PLAN

Bitmap Heap Scan on t1 (cost=8.05..19.02 rows=6 width=36) (actual time=0.019..0.019 rows=0 loops=1)
  Recheck Cond: (to_tsvector(‘english’::regconfig, t) @@ ”’foo”’::tsquery)
  -> Bitmap Index Scan on t1_to_tsvector_idx (cost=0.00..8.05 rows=6 width=0) (actual time=0.017..0.017 rows=0 loops=1)
     Index Cond: (to_tsvector(‘english’::regconfig, t) @@ ”’foo”’::tsquery)
Planning Time: 0.076 ms
Execution Time: 0.051 ms
(6 rows)

And I'm confused. Is there a good case for storing the vector in the row? I'm not fluent enough in this part of Postgres to build a satisfactory test. I'm still researching FTS and fuzzy comparison options in Postgres, but will be needing to build something substantial in the next quarter, which is why I'm keen to get the details right in advance. 35M rows now, increasing rapidly.

Nepheline answered 1/11, 2019 at 20:55 Comment(7)
"I asked about this in passing previous (can't find it)" You asked over at dba, dba.stackexchange.com/questions/251307/…Bokbokhara
Thanks for spotting it, the answer now makes more sense to me. I ticked it as answered there, much appreciated. I still need to check out phraseto_tsquery and study the results. I've only tested with to_tsquery, which may not show the behavior you describe?Nepheline
I read your comment on Rob Conery's blog page. The main point of using a ts_vector expression index is that you don't have the storage of the ts_vector column and the gin index on it, but, just the gin index directly, so, most likely you will use less space. I have used the FTS capabilities in Postgres quite a bit, trigram, soundex, etc, too, and while it does work, we eventually decided to more to Elastic or Solr on top of Lucene, which is designed for one purpose only -- indexing and searching on documents, structured or not. Don't get me wrong, I love Postgres, and Postgis (I'm a GIS dev).Abey
But, ultimately, we found that once the documents got to a certain size, and 35M rows is defo big enough to notice this, things can get quite slow, and it is quite painful having to explicitly combine different columns and decide what to index. In Lucene you get all this for free, and it is much easier to weight results from different part of the doc, etc. You can, get the best of all worlds, and use Elastic on top of data strored in Postgresql. It may well be that your data are clean enough, and your queries will always be on the same columns, in which case, I will shut up. Just my $0.02.Abey
Thanks for the perspective, that's all pretty interesting. Of course, You Are a Monster for not using PG for everything ;-) In our case, we're not indexing documents, it's more for searching on rows with text fields. But I'm open to another tool, if we can integrate it. We're on RDS, so I'll have to check how that works there. I'm only at the early stages of sorting out our needs and a strategy to match, so it's the perfect time to explore options.Nepheline
Yes, as I said, if your queries are well defined, and data well structured, PG will do more than fine. And, of course, it scales to billions of rows, if you use it properly. Apologies for being a traitor to the one true RDBMS :-)Abey
;-) I'm revisiting our DevOps/logging instrumentation, and am feeling more open-minded than usual about Elastic search. So, I've got multiple reasons to check it out. It's always annoyed me that every logging platform seems to need to invent it's own query language....but Lucene looks like it will stick around. So, it's likely worth the effort.Nepheline
P
5

As you asked for possible reasons to include a tsvector field, I think one is for sure if you had to put into the tsvector text from columns of other tables.

If you have a table of blog articles and you want execute a full-text search over title, content and related tag names, but tags is a separated table you can't, unfortunately, use generated column or function in index, but you have to update your tsvector (with a trigger or similar) and than you can update an index on the column.

I hope I gave to you a reasonable reason to use a tsvecor field.

Postulant answered 11/11, 2019 at 9:42 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.