Does PostgreSQL use tf-idf?
Asked Answered
H

4

8

I would like to know whether full text search in PostgreSQL 9.3 with GIN/GiST index uses tf-idf (term frequency-inverse document frequency).

In particular, in my columns of phrases, I have some words that are more popular, whereas some are quite unique (i.e., names). I want to index these columns so that the unique words matched will be weighted higher than common words.

Hokeypokey answered 18/8, 2013 at 6:44 Comment(0)
B
4

No. Within the ts_rank function, there is no native method to rank results using their global (corpus) frequency. The rank algorithm does however rank based on frequency within the document:

http://www.postgresql.org/docs/9.3/static/textsearch-controls.html

So if I search for "dog|chihuahua" the following two documents would have the same rank despite the relatively lower frequency of the word "chihuahua":

"I want a dog"
"I want a chihuahua"

However, the following line would get ranked higher than the previous two lines above, because it contains the stemmed token "dog" twice in the document:

"dog lovers have an average of 1.5 dogs"

In short: higher term frequency within the document results in a higher rank, but a lower term frequency in the corpus has no impact.

One caveat: the text search does ignore stop-words, so you will not match on ultra high frequency words like "the","a","of","for" etc (assuming you have correctly set your language)

Bossism answered 18/7, 2014 at 17:40 Comment(0)
M
4

No Postgres does not use TF-IDF as a similarity measure among documents.

ts_rank is higher if a document contains query terms more frequently. It does not take into account the global frequency of the term.

ts_rank_cd is higher if a document contains query terms closer together and more frequently. It does not take into account the global frequency of the term.

There is an extension from the text search creators called smlar, that lets you calculate the similarity between arrays using TF-IDF. It also lets you turn tsvectors into arrays, and supports fast indexing.

Median answered 1/8, 2014 at 1:19 Comment(0)
I
2

It does if you use ts_vector to store the TF, GIN to store the IDF and ts_query to query the data.

I found this article on Efficiently searching text using postgres helpful to set it up.

Isobelisocheim answered 22/12, 2021 at 22:46 Comment(1)
The shared article doesn't seem to use IDF in the ranking as far as I could seeHemoglobin
C
-1

Mostly. The details are described at http://www.postgresql.org/docs/9.1/static/textsearch-controls.html

The basic problem is that the term frequency is not really something based on the corpus you are indexing but rather set in the dictionary. So it looks to me like, as long as you properly select a language, you should be ok.

Crotch answered 10/11, 2013 at 15:34 Comment(1)
tf-idf refers to corpus frequency, not document/dictionary frequency.Bossism

© 2022 - 2024 — McMap. All rights reserved.