SQL Server Full-Text Search for exact match with fallback
Asked Answered
P

4

12

First off there seems to be no way to get an exact match using a full-text search. This seems to be a highly discussed issue when using the full-text search method and there are lots of different solutions to achieve the desired result, however most seem very inefficient. Being I'm forced to use full-text search due to the volume of my database I recently had to implement one of these solutions to get more accurate results.

I could not use the ranking results from the full-text search because of how it works. For instance if you searched for a movie called Toy Story and there was also a movie called The Story Behind Toy Story that would come up instead of the exact match because it found the word Story twice and Toy.

I do track my own rankings which I call "Popularity" each time a user access a record the number goes up. I use this datapoint to weight my results to help determine what the user might be looking for.

I also have the issue where sometimes need to fall back to a LIKE search and not return an exact match. I.e. searching Goonies should return The Goonies (most popular result)

So here is an example of my current stored procedure for achieving this:

DECLARE @Title varchar(255)
SET @Title = '"Toy Story"'
--need to remove quotes from parameter for LIKE search
DECLARE @Title2 varchar(255)
SET @Title2 = REPLACE(@title, '"', '')

--get top 100 results using full-text search and sort them by popularity
SELECT TOP(100) id, title, popularity As Weight into #TempTable FROM movies WHERE CONTAINS(title, @Title) ORDER BY [Weight] DESC

--check if exact match can be found
IF EXISTS(select * from #TempTable where Title = @title2)
--return exact match
SELECT TOP(1) * from #TempTable where Title = @title2
ELSE
--no exact match found, try using like with wildcards
SELECT TOP(1) * from #TempTable where Title like '%' + @title2 + '%'
DROP TABLE #TEMPTABLE

This stored procedure is executed about 5,000 times a minute, and crazy enough it's not bringing my server to it's knees. But I really want to know if there was a more efficient approach to this? Thanks.

Pains answered 21/4, 2013 at 11:13 Comment(0)
P
6

You should use full text search CONTAINSTABLE to find the top 100 (possibly 200) candidate results and then order the results you found using your own criteria.

It sounds like you'd like to ORDER BY

  1. exact match of the phrase (=)
  2. the fully matched phrase (LIKE)
  3. higher value for the Popularity column
  4. the Rank from the CONTAINSTABLE

But you can toy around with the exact order you prefer.

In SQL that looks something like:

DECLARE @title varchar(255)
SET @title = '"Toy Story"'
--need to remove quotes from parameter for LIKE search
DECLARE @title2 varchar(255)
SET @title2 = REPLACE(@title, '"', '')

SELECT
    m.ID,
    m.title,
    m.Popularity,
    k.Rank
FROM Movies m
INNER JOIN CONTAINSTABLE(Movies, title, @title, 100) as [k]
    ON m.ID = k.[Key]
ORDER BY 
  CASE WHEN m.title = @title2 THEN 0 ELSE 1 END,
  CASE WHEN m.title LIKE @title2 THEN 0 ELSE 1 END,
  m.popularity desc,
  k.rank

See SQLFiddle

Praise answered 17/2, 2014 at 19:12 Comment(5)
Could you explain the ORDER BY CASE section, because somehow it is working but it should be throwing a "The ORDER BY position number 0 is out of range of the number of items in the select list" in certain cases which has me very confused. ThanksPains
@Pains The ORDER BY expressions do not refer to the index of the result. But the results are listed in the order of the value of the expression. So for the row where the title is an exact match, the first expression (CASE WHEN m.title = @title2 THEN 0 ELSE 1 END) evaluates to 0. This row is listed before all other rows for which the title is not an exact match. For for those rows the first expression evaluates to 1.Praise
Thanks for the response, I'm going to give this a shot as I'm trying to optimize all my heavy stored procedures (25+ million requests a day) I have to see if your inner join approach will use less resources than my select 100 into a #temp table with 2 selects run against that. Your method sure does look nicer to read however.Pains
@Pains You can do the same ORDER BY trick on the temp table too, if the join's performance is disappointing. Let us know what works best!Praise
I just added that ORDER BY trick to the temp table approach: sqlfiddle.com/#!6/939c8/40 I almost forgot why I was forced to go this route with the temp table and it's due to the exact match/LIKE wildcard search, which cannot be performed on the main table as it has over 2 million records, but runs fine against a top 100 results from the full-text search. Thanks again for your help.Pains
C
2

This will give you the movies that contain the exact phrase "Toy Story", ordered by their popularity.

SELECT
    m.[ID],
    m.[Popularity],
    k.[Rank]
FROM [dbo].[Movies] m
INNER JOIN CONTAINSTABLE([dbo].[Movies], [Title], N'"Toy Story"') as [k]
    ON m.[ID] = k.[Key]
ORDER BY m.[Popularity]

Note the above would also give you "The Goonies Return" if you searched "The Goonies".

Carmelcarmela answered 15/5, 2013 at 1:21 Comment(2)
Hey J Lo, that method will not work. I need something that will default to an exact match or use the closest result if not found. using the CONTAINSTABLE approach has the same issue mentioned above, for instance if you search "Toy Story 2" it would return "Toy Story" as the top result. This could also be a limitation of full-text not indexing numbers or single characters properly.Pains
I did recently disable all Stoplist/Noise Words so maybe the full text rankings will be more accurate? I will retest this and let you know.Pains
F
1

If got the feeling you don't really like the fuzzy part of the full text search but you do like the performance part.

Maybe is this a path: if you insist on getting the EXACT match before a weighted match you could try to hash the value. For example 'Toy Story' -> bring to lowercase -> toy story -> Hash into 4de2gs5sa (with whatever hash you like) and perform a search on the hash.

Footboy answered 12/2, 2014 at 7:47 Comment(2)
I have to use full text searching for grabbing the top 100 results (there is over 950k records and 3-5k requests a minute) from there I can use a less optimized method for searching those 100. Converting the titles to HashBytes and performing an exact search doesn't seem to take away any steps in my current process?Pains
Hard to say from this end, you need a top 100 result - beside the performance issues - do you get away without the full text search? If so... hashing may be a good road. 950K records and 100 request per second should be possible with proper indexes (and far more requests per second). Keep the index as small as possible (to make it fit into memory very easy) and stress test that solution. Build a random set of queries and measure the whole thing before choosing a final path.Footboy
D
0

In Oracle I've used UTL_MATCH for similar purposes. (http://docs.oracle.com/cd/E11882_01/appdev.112/e25788/u_match.htm)

Even though using the Jaro Winkler algorithm, for instance, might take awhile if you compare the title column from table 1 and table 2, you can improve performance if you partially join the 2 tables. I have in some cases compared person names on table 1 with table 2 using Jaro Winkler, but limited results not just above a certain Jaro Winkler threshold, but also to names between the 2 tables where the first letter is the same. For instance I would compare Albert with Aden, Alfonzo, and Alberto, using Jaro Winkler, but not Albert and Frank (limiting the number of situations where the algorithm needs to be used).

Jaro Winkler may actually be suitable for movie titles as well. Although you are using SQL server (can't use the utl_match package) it looks like there is a free library called "SimMetrics" which has the Jaro Winkler algorithm among other string comparison metrics. You can find detail on that and instructions here: http://anastasiosyal.com/POST/2009/01/11/18.ASPX?#simmetrics

Dao answered 17/2, 2014 at 18:46 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.