Postgres full text search rank by position
Asked Answered
T

5

7

I have a table of movies, and I want to search on the titles and return the closest match.

I thought full text search might be useful, but it doesn't seem to be able to order by position of the words, although postgres knows the position. Is this possible in postgres?

Here is my query:

SELECT collectibles.id, collectibles.title, ts_rank_cd(to_tsvector('english', collectibles.title), plainto_tsquery('old school')) AS score
FROM collectibles WHERE to_tsvector('english', collectibles.title) @@ plainto_tsquery('old school')
ORDER BY score DESC;

Here are some of the results: (this is the best formatting I can seem to figure out, sorry!)

id | title | score

 - 277568 | Wilson Meadows: Live At The 15th Old School & Blues Festival | 0.1
 - 3545 | 5 Film Collection: Will Ferrell: Campaign / Old School (Unrtated Version) / Blades Of Glory / Roxbury / Semi-Pro | 0.1
 - 10366 | Alice Cooper: Old School: 1964-1974 (DVD/CD Combo) | 0.1
 - 13004 | American Classics: Old School (3-Disc Set) | 0.1
 - 13005 | American Classics: Old School: Classic Chevrolets | 0.1
 - 13006 | American Classics: Old School: Classic Travel Trailers | 0.1
 - 13007 | American Classics: Old School: Kings Of Kustomizing | 0.1
 - 14592 | Anchorman: The Legend Of Ron Burgundy (Widescreen/ Extended Edition) / Old School (R-Rated Version) (Back-To-Back) | 0.1
 - 14593 | Anchorman: The Legend Of Ron Burgundy (Widescreen/ Extended Edition) / Old School (R-Rated Version) (Side-By-Side) | 0.1
 - 20242 | Audiovisualize: Mixed By Addictive TV: Snake Worship Island / Corp. Inc. / Old School Futures / These Melodies / Robot War / ... | 0.1
 - 192057 | Old School (DreamWorks/ Widescreen/ Unrated Version/ Special Edition) | 0.1
 - 192058 | Old School (DreamWorks/ Widescreen/ Unrated Version/ Special Edition) / Road Trip (R-Rated) (Back-To-Back) | 0.1
 - 192059 | Old School (DreamWorks/ Widescreen/ Unrated Version/ Special Edition) / Road Trip (R-Rated) (Side-By-Side) | 0.1
 - 192060 | Old School (DreamWorks/ Widescreen/ Unrated Version/ Special Edition) / Road Trip (Unrated) (Back-To-Back) | 0.1
 - 192061 | Old School (DreamWorks/ Widescreen/ Unrated Version/ Special Edition) / Road Trip (Unrated) (Side-By-Side) | 0.1
 - 192062 | Old School (Warner Brothers/ R-Rated Version) | 0.1
 - 192063 | Old School (Warner Brothers/ R-Rated Version/ Blu-ray) | 0.1
 - 192064 | Old School (Warner Brothers/ Unrated Version) | 0.1
 - 192065 | Old School (Warner Brothers/ Unrated Version/ Blu-ray) | 0.1
 - 192066 | Old School Comedy (4-Pack): Atoll K / Jack And The Beanstalk / The Flying Deuces / Africa Screams | 0.1
 - 192067 | Old School Hip Hop Dance #1: Beginner | 0.1
 - 192068 | Old School Hip Hop Greatest | 0.1
 - 192069 | Old School Hip Hop: Run DMC & Flava Flav (2-Disc) | 0.1
 - 192070 | Old School Hits Movie Marathon Collection (3-Disc) | 0.1
 - 192071 | Old School Returns | 0.1

The score for all of these is 0.1, but the position of the words in many of the titles are closer to the front of the string. Is there any way to rank those as higher? The length of the string or id aren't really good qualifiers of rank, unfortunately.

Trudi answered 26/2, 2016 at 4:51 Comment(0)
F
5

Here you need to use normalization for ts_rank(tsvector,tsquery,normalization factor) function. In the code snippet below I used normalization=1 (which divides the rank by 1 + the logarithm of the document length), but you can adjust it to what you really need. Here is the example:

WITH s(id,tsv) AS ( VALUES
  (1,to_tsvector('english','Alice Cooper: Old School: 1964-1974 (DVD/CD Combo)')),
  (2,to_tsvector('english','American Classics: Old School: Kings Of Kustomizing')),
  (3,to_tsvector('english','Old School Hip Hop Greatest')),
  (4,to_tsvector('english','Old School Returns'))
)
SELECT id,ts_rank(tsv,tsq,1) AS rank
FROM s,to_tsquery('english','old & school') tsq
ORDER BY rank DESC;

Result:

 id |   rank    
----+-----------
  4 | 0.0495516
  3 | 0.0383384
  2 | 0.0353013
  1 | 0.0312636
(4 rows)
Folliculin answered 26/2, 2016 at 5:53 Comment(2)
Normalization is better, but I was hoping to just rank by the placement of words. Consider a search for 'silicon valley': Pirates of silicon valley - Silicon Valley: The American Experience - Silicon Valley: The Complete 1st Season. Ideally, the title with Silicon Valley at the front would be first, but it is not the shortest string. If postgres can't do this, then I will likely use normalization and later switch to a different searching system.Trudi
@Trudi add "(5,to_tsvector('english','Some Old School'))" and it will get the highest rankShores
C
2

I managed to achieve this by splitting the parts, getting the first word and setting it a higher priority (A):

setweight(to_tsvector('english', split_part(coalesce("title", ''), ' ', 1) ), 'A') ||
setweight(to_tsvector('english', coalesce("title", '')), 'B')
Corene answered 8/8, 2021 at 13:37 Comment(0)
B
0

Pretty old question but: You can use ts_rank_cd() to take the distance between lexemes (keywords) into account. (i didn't find out how this is done exactly)

You can also pass bit 4 to the normalization integer (it's a bitmask) to divide the rank by the the mean harmonic distance between extents (with ts_rank_cd)

I didn't look to much into this but it's hopefully a starting point

Breuer answered 24/2, 2021 at 14:56 Comment(0)
P
0

I was facing the same problem too. Unfortunately, ts_rank does not have normalization option for the position.

But I was quite successful to "boost" ts_rank with a POSITION() function.

As ts_rank is just a number you can simply combine it with other expressions. The POSITION() function will return the position of the text in matter. I am dividing the original rank by the position value. + 0.001 is to avoid dividing by zero.

Here is my tip for applying the pattern that I successfully used in my Node app:

`
SELECT 
  collectibles.id, 
  collectibles.title, 
  ts_rank_cd(to_tsvector('english', collectibles.title), plainto_tsquery('old school')) 
   /
   (POSITION('old school' IN collectibles.title + 0.001) 
   AS score
FROM collectibles WHERE to_tsvector('english', collectibles.title) @@ plainto_tsquery('old school')
ORDER BY score DESC;
`
Pica answered 24/7 at 14:17 Comment(0)
S
-1

The documentation says

Also, * can be attached to a lexeme to specify prefix matching

with

to_tsquery can also accept single-quoted phrases

you can do this:

SELECT to_tsquery('''old school'':*');
      to_tsquery      
----------------------
 'old':* & 'school':*
(1 row)

so your case will look like:

SELECT 
  collectibles.id,
  collectibles.title,
  ts_rank_cd(
    to_tsvector('english', collectibles.title),
    to_tsquery('''old school'':*')
  ) AS score
FROM collectibles
WHERE to_tsvector('english', collectibles.title) @@ to_tsquery('''old school'':*')
ORDER BY score DESC;
Shores answered 20/12, 2016 at 9:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.