How can I manipulate MySQL fulltext search relevance to make one field more 'valuable' than another?
Asked Answered
E

9

48

Suppose I have two columns, keywords and content. I have a fulltext index across both. I want a row with foo in the keywords to have more relevance than a row with foo in the content. What do I need to do to cause MySQL to weight the matches in keywords higher than those in content?

I'm using the "match against" syntax.

SOLUTION:

Was able to make this work in the following manner:

SELECT *, 
CASE when Keywords like '%watermelon%' then 1 else 0 END as keywordmatch, 
CASE when Content like '%watermelon%' then 1 else 0 END as contentmatch,
MATCH (Title, Keywords, Content) AGAINST ('watermelon') AS relevance 
FROM about_data  
WHERE MATCH(Title, Keywords, Content) AGAINST ('watermelon' IN BOOLEAN MODE) 
HAVING relevance > 0  
ORDER by keywordmatch desc, contentmatch desc, relevance desc 
Enosis answered 13/2, 2009 at 20:26 Comment(0)
H
21

Actually, using a case statement to make a pair of flags might be a better solution:

select 
...
, case when keyword like '%' + @input + '%' then 1 else 0 end as keywordmatch
, case when content like '%' + @input + '%' then 1 else 0 end as contentmatch
-- or whatever check you use for the matching
from 
   ... 
   and here the rest of your usual matching query
   ... 
order by keywordmatch desc, contentmatch desc

Again, this is only if all keyword matches rank higher than all the content-only matches. I also made the assumption that a match in both keyword and content is the highest rank.

Haldane answered 13/2, 2009 at 21:51 Comment(2)
Using the like statement is not a great way to run searches. First, unless you split strings, you'll only match in the exact order. i.e. searching LIKE '%t-shirt red%' will not match 'Red t-shirt' in your database. Second, you end up with a higher time to execute your query, since LIKE does a full table scan.Strickler
@Strickler LIKE does a full table scan when it is used in the FROM clause not in the SELECTSext
C
101

Create three full text indexes

  • a) one on the keyword column
  • b) one on the content column
  • c) one on both keyword and content column

Then, your query:

SELECT id, keyword, content,
  MATCH (keyword) AGAINST ('watermelon') AS rel1,
  MATCH (content) AGAINST ('watermelon') AS rel2
FROM table
WHERE MATCH (keyword,content) AGAINST ('watermelon')
ORDER BY (rel1*1.5)+(rel2) DESC

The point is that rel1 gives you the relevance of your query just in the keyword column (because you created the index only on that column). rel2 does the same, but for the content column. You can now add these two relevance scores together applying any weighting you like.

However, you aren't using either of these two indexes for the actual search. For that, you use your third index, which is on both columns.

The index on (keyword,content) controls your recall. Aka, what is returned.

The two separate indexes (one on keyword only, one on content only) control your relevance. And you can apply your own weighting criteria here.

Note that you can use any number of different indexes (or, vary the indexes and weightings you use at query time based on other factors perhaps ... only search on keyword if the query contains a stop word ... decrease the weighting bias for keywords if the query contains more than 3 words ... etc).

Each index does use up disk space, so more indexes, more disk. And in turn, higher memory footprint for mysql. Also, inserts will take longer, as you have more indexes to update.

You should benchmark performance (being careful to turn off the mysql query cache for benchmarking else your results will be skewed) for your situation. This isn't google grade efficient, but it is pretty easy and "out of the box" and it's almost certainly a lot lot better than your use of "like" in the queries.

I find it works really well.

Cram answered 2/3, 2009 at 0:34 Comment(5)
I could not seem to get this to work (perhaps because I had not added the third index), but changing the where condition to: rel1 > 0 OR rel2 > 0 solved my problem so thanks.Carol
@Cram should the Order By not be ORDER BY (rel1*1.5)+(rel2) DESC to get the highest score and thus more relevant first?Antebi
@Antebi yes it should be DESC since higher relevance is a better matchDeadly
@Cram I just wanted to say thanks, this exact query (adapted to our schema) has been chugging along for at least five years now in a community website with tens of thousands of news articles and hundreds of thousands of registered users (and many more unregistered visitors). Always worked perfectly well for our needs, and we never had performance issues.Homerus
Couldn't you avoid the duplicate fulltext index by using WHERE MATCH (keyword) AGAINST ('watermelon') OR MATCH (content) AGAINST ('watermelon')?Godfather
H
21

Actually, using a case statement to make a pair of flags might be a better solution:

select 
...
, case when keyword like '%' + @input + '%' then 1 else 0 end as keywordmatch
, case when content like '%' + @input + '%' then 1 else 0 end as contentmatch
-- or whatever check you use for the matching
from 
   ... 
   and here the rest of your usual matching query
   ... 
order by keywordmatch desc, contentmatch desc

Again, this is only if all keyword matches rank higher than all the content-only matches. I also made the assumption that a match in both keyword and content is the highest rank.

Haldane answered 13/2, 2009 at 21:51 Comment(2)
Using the like statement is not a great way to run searches. First, unless you split strings, you'll only match in the exact order. i.e. searching LIKE '%t-shirt red%' will not match 'Red t-shirt' in your database. Second, you end up with a higher time to execute your query, since LIKE does a full table scan.Strickler
@Strickler LIKE does a full table scan when it is used in the FROM clause not in the SELECTSext
E
7

Simpler version using only 2 fulltext indexes (credits taken from @mintywalker):

SELECT id, 
   MATCH (`content_ft`) AGAINST ('keyword*' IN BOOLEAN MODE) AS relevance1,  
   MATCH (`title_ft`) AGAINST ('keyword*' IN BOOLEAN MODE) AS relevance2
FROM search_table
HAVING (relevance1 + relevance2) > 0
ORDER BY (relevance1 * 1.5) + (relevance2) DESC
LIMIT 0, 1000;

This will search both full indexed columns against the keyword and select matched relevance into two separate columns. We will exclude items with no match (relevance1 and relevance2 are both zero) and reorder results by increased weight of content_ft column. We don't need composite fulltext index.

Ekaterinburg answered 8/7, 2017 at 12:2 Comment(8)
By utilizing "HAVING" instead of a WHERE ( with the composite or something else ), you run into an issue of having to do a full table scan to get your result. Meaning, I don't believe this solution scales very well. To be more specific, in an extreme scenario, if you have a table with 10M rows, and only 999 match ( or n-1 of whatever limit you set ), since all rows will return results in your query , most albeit with 0's, you will not only have to load the entire table, but you would also have to iterate through all 10M rows.Ewing
@Ewing Having clause operates over only matched resultset.Ekaterinburg
correct, but literally every record in the table is going to be matched in that query because there is nothing to filter it. Meaning, you're selecting values from the table, but without a where, you're retrieving all the records, then having is executing the filter on them. To clarify, remove the having statement from your search locally. All records are returned. Imagine that on a table with 10M records. Run an explain, and it will probably say using temporary; using filesort. The where like in mintywalker's response allows the records to be filtered first on the server.Ewing
@Ewing Yes, you are right - without where clause it scans over whole resultset. The idea was to avoid complex fulltext indexing, which may cause large overhead for intensive writes. Fixing this is simply possible by adding WHERE clause between FROM ... HAVING, but then whole query does not look so simple anymore + duplicates fullindex match. Query above may work fine for small datasets say up to 10k-100k records - depends on.Ekaterinburg
Wouldn't using WHERE relevance1 > 0 OR relevance2 > 0 take care of the issue?Godfather
@Godfather No, because WHERE works on existing columns, HAVING applies to GROUP BY (aggregates). See geeksforgeeks.org/having-vs-where-clause-in-sqlEkaterinburg
Ok, so it's more about WHERE not applying to aliases than having to do with group by since there is no GROUP BY in the query, then repeating the select in the where works WHERE MATCH (`content_ft`) AGAINST ('keyword*' IN BOOLEAN MODE) > 0 OR MATCH (`title_ft`) AGAINST ('keyword*' IN BOOLEAN MODE) > 0Godfather
HAVING operates on filtered dataset after evaluating the MATCH() clauses. That means after relevance score calculated for all rows. You may also rewrite the query with sub-clause using WHERE like here #35830786 . It's not only GROUP BY when HAVING should be used, sorry I was not precise.Ekaterinburg
D
1

In Boolean mode, MySQL supports the > and < operators to change a word's contribution to the relevance value that is assigned to a row.

I wonder if something like this would work?

SELECT *, 
MATCH (Keywords) AGAINST ('>watermelon' IN BOOLEAN MODE) AS relStrong, 
MATCH (Title,Keywords,Content) AGAINST ('<watermelon' IN BOOLEAN MODE) AS relWeak 
FROM about_data  
WHERE MATCH(Title, Keywords, Content) AGAINST ('watermelon' IN BOOLEAN MODE) 
ORDER by (relStrong+relWeak) desc
Decaliter answered 10/8, 2009 at 9:13 Comment(0)
A
-1

Well, that depends on what do you exactly mean with:

I want a row with foo in the keywords to have more relevance than a row with foo in the content.

If you mean that a row with foo in the keywords should come before any row with foo in the content, then I will do two separate queries, one for the keywords and then (possibly lazily, only if it's requested) the other one on the content.

Ariana answered 16/2, 2009 at 3:26 Comment(0)
M
-1

I did this a few years ago, but without the full text index. I don't have the code handy (former employer), but I remember the technique well.

In a nutshell, I selected a "weight" from each column. For example:

select table.id, keyword_relevance + content_relevance as relevance from table
   left join
      (select id, 1 as keyword_relevance from table_name where keyword match) a
   on table.id = a.id
   left join
      (select id, 0.75 as content_relevance from table_name where content match) b
   on table.id = b.id

Please forrgive any shoddy SQL here, it's been a few years since I needed to write any, and I'm doing this off the top of my head...

Hope this helps!

J.Js

Monticule answered 17/2, 2009 at 15:28 Comment(0)
B
-1

I needed something similar and used the OP's solution, but I noticed that fulltext doesn't match partial words. So if 'watermelon' is in Keywords or Content as part of a word (like watermelonsalesmanager) it doesn't MATCH and is not included in the results because of the WHERE MATCH. So I fooled around a bit and tweaked the OP's query to this:

SELECT *, 
CASE WHEN Keywords LIKE '%watermelon%' THEN 1 ELSE 0 END AS keywordmatch, 
CASE WHEN Content LIKE '%watermelon%' THEN 1 ELSE 0 END AS contentmatch,
MATCH (Title, Keywords, Content) AGAINST ('watermelon') AS relevance 
FROM about_data  
WHERE (Keywords LIKE '%watermelon%' OR 
  Title LIKE '%watermelon%' OR 
  MATCH(Title, Keywords, Content) AGAINST ('watermelon' IN BOOLEAN MODE)) 
HAVING (keywordmatch > 0 OR contentmatch > 0 OR relevance > 0)  
ORDER BY keywordmatch DESC, contentmatch DESC, relevance DESC

Hope this helps.

Bristle answered 1/2, 2011 at 12:6 Comment(0)
Z
-2

As far as I know, this isn't supported with MySQL fulltext search, but you can achieve the effect by somehow repeating that word several times in the keyword field. Instead of having keywords "foo bar", have "foo bar foo bar foo bar", that way both foo and bar are equally important within the keywords column, and since they appear several times they become more relevant to mysql.

We use this on our site and it works.

Zia answered 13/2, 2009 at 20:34 Comment(0)
H
-4

If the metric is just that all the keyword matches are more "valuable" than all the content matches then you can just use a union with row counts. Something along these lines.

select *
from (
   select row_number() over(order by blahblah) as row, t.*
   from thetable t
   where keyword match

   union

   select row_number() over(order by blahblah) + @@rowcount + 1 as row, t.*
   from thetable t
   where content match
)
order by row

For anything more complicated than that, where you want to apply an actual weight to every row, I don't know how to help.

Haldane answered 13/2, 2009 at 20:46 Comment(4)
I tried this, and ended up with syntax errors. I don't think I knew what to put in the order by blahblah spot. Suggestions?Enosis
Sorry, it wasn't mean to be a copy & paste example. The order by in the over clause is the order you apply the row numbers, so it should be whatever you would normally order the results by.Haldane
Now that I think about it, this one will duplicate the records which match both keyword and content.Haldane
I am not able to find any way to make this work. In fact, I don't think mysql supports row_numberEnosis

© 2022 - 2024 — McMap. All rights reserved.