What are the rules to order the keywords in a MySQL boolean search?
Asked Answered
M

2

6

When I change the order of the keywords in a boolean search, I get the same result but very different performance results.

The profiling on MySQL 5.6.33 with a MyISAM table, ft_min_word_len=2 and description_index as a FULLTEXT index on title and description returns this:

# Query 1
SELECT id
FROM archive, topic
WHERE topic.type=0 AND archive.status=2
AND MATCH(title, description) AGAINST ('+house* +tz*' IN BOOLEAN MODE)
AND archive.topicId = topic.id
ORDER BY archive.featured DESC, archive.submissionDate DESC LIMIT 0,20

Result:

Total count: 12
Key_read_requests: 2384607
Creating sort index: 7.950430 sec (!)
Duration: 8.851252 sec

# Query 2
SELECT id
FROM archive, topic
WHERE topic.type=0 AND archive.status=2
AND MATCH(title, description) AGAINST ('+tz* +house*' IN BOOLEAN MODE)
AND archive.topicId = topic.id
ORDER BY archive.featured DESC, archive.submissionDate DESC LIMIT 0,20

Result:

Total count: 12
Key_read_requests: 415
Creating sort index: 0.003449
Duration: 0.004054 sec

Total records per keyword:

tz*: 135092
tz: 25596
house*: 12

Explain is the same for both queries:

id | select_type | Table   | Type     | Key               | Key len | Ref             | Rows | Extra
1  | SIMPLE      | archive | fulltext | description_index | 0       |                 | 1    | Using where; Using filesort
1  | SIMPLE      | topic   | eq_ref   | PRIMARY           | 3       | archive.topicId | 1    | Using where

Only Key_read_requests and Creating sort index are different between the 2 queries.

It seems that:

  1. the order of the keyword order is a critical performance factor
  2. the keywords are used in reverse order
  3. having the most discriminating keyword at the end improves the performance.

Questions:

  • What is the reason of this big performance difference?
  • What are the rules/best practices? (I could not find anything in the documentation of mysql).
Merilyn answered 21/1, 2017 at 20:59 Comment(9)
MyISAM is going away; see what you get with InnoDB's variant of FULLTEXT.Catfall
@RickJames MyISAM is still better in some situations (our case). We are waiting for mysql 8 which seems to have great innodb perfs. Until then, we need to fix and understand this issue. :) percona.com/blog/2016/10/11/mysql-8-0-end-myisamMerilyn
Can you please post output of EXPLAIN for both queries?Oestradiol
Do you have FULLTEXT index? If you do, what's its definition?Oestradiol
@Oestradiol Seems like the issue is more related to how the fulltext analyzes the string in boolean mode. I have updated the question. Please note that everything is the same (query plan, values, explain, etc) except the values of Key_read_requests and Creating sort index.Merilyn
Can you try same queries, but without ORDER BY and/or LIMIT? Create sort index happens when sorting required before outputting results when no existing index supports required sorting.Oestradiol
We profiled simplified version of the query till the most simplified one. The issue is always there: performance is directly linked to the order of keywords. Sometimes by a factor of 100x.Merilyn
Is performance always linked to "Creating sort index" step? Does this step happen even without ORDER BY and LIMIT clauses? Have you tries other pairs of search words and using other wildcards?Oestradiol
@stoleg Exact. "Creating sort index" is only linked to the order of the keywords even with a simplified query (without order by, limit and the other conditions).Merilyn
S
1

Edit after OP comment:
I'm not sure about the exact query plan when resolving this query.
Sometimes one operation is more expensive than another therefore doing the less expensive operation first can sort out many rows that then don't have to go through the more expensive operation which leads to a reduced running time.
(In your example one of the matching-operations could be more expensive than the other which increases and reduces running time by changing the order of the strings to match against).

Stamp answered 30/1, 2017 at 14:53 Comment(5)
I understand the cardinality issue. :) The real question is: Are the keywords always searched in reverse order (in boolean mode)? Because, if it is the case, I have to rebuild the query to sort the keywords from the less secregating to the most.Merilyn
Sorry, I made a terrible mistake in my question (fixed now). The 3 values were not the cardinality but the number of records matching the keyword. :(Merilyn
Ok, no problem. I rewrote my answer I also missunderstood a bit of what you wrote.Stamp
Actually, the query plan is always the same when using a full text search in boolean mode with + (AND): each keyword is analyzed in reverse order (I though it would take the most discriminating one first). In my question I put the full query, but even with a simplified one (only joints and a full text search in boolean mode), it is the same result.Merilyn
In your example both keywords must be present for a match to occur. If the faster search operation runs first and eliminates a record, the slower search operation will be skipped for that record. This wil save time and should be the reason for the difference in performance.Stamp
K
-1

Did you run both queries against a fresh started instance? Its possible you are just gaining perfomance plus due to filled caches on the second run.

Knotweed answered 28/1, 2017 at 11:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.