Let's say we have the following query:
SELECT *
FROM companies
WHERE name LIKE '%nited'
It returns
name |
---|
united |
How do I write a query using MySQL's full-text search that will provide similar results?
Let's say we have the following query:
SELECT *
FROM companies
WHERE name LIKE '%nited'
It returns
name |
---|
united |
How do I write a query using MySQL's full-text search that will provide similar results?
Unfortunately you cannot do this using a MySQL full-text index. You cannot retrieve '*nited states'
instantly from index because left characters are the most important part of the index. However, you can search 'United Sta*'
.
// the only possible wildcard full-text search in MySQL
WHERE MATCH(column) AGAINST ('United Sta*' IN BOOLEAN MODE)
MySQL's full-text performs best when searching whole words in sentences - even that can suck at times. Otherwise, I'd suggest using an external full-text engine like Solr or Sphinx. I think Sphinx allows prefix and suffix wildcards, not sure about the others.
You could go back to MySQL's LIKE
clause, but again, running queries like LIKE '%nited states'
or LIKE '%nited Stat%'
, will also suffer on performance, as it can't use the index on the first few characters. 'United Sta%'
and 'Unit%States'
are okay as the index can be used against the first bunch of known characters.
Another quite major caveat using MySQL's full-text indexing is the stop-word list and minimum word length settings. For example, on a shared hosting environment, you will be limited to words greater than or equal to 4-characters. So searching 'Goo' to get 'Google' would fail. The stop-word list also disallows common words like 'and', 'maybe' and 'outside' - in-fact, there are 548 stop-words all together! Again, if not using shared hosting, these settings are relatively easily to modify, but if you are, then you will get annoyed with some of the default settings.
You can use MySQL's full-text index, but you must configure the parser to be the n-gram parser.
If your data is English (as opposed to Chinese or similar), you ought to also increase the ngram_token_size
parameter to the minimum search term length you are willing to have. Otherwise, the search will be unacceptably slow.
You will also want to set innodb_ft_enable_stopword=0
, otherwise an idiosyncrasy of how ngram stopword handling works will mean that many useful queries will return no results.
To explain why you must also increase ngram_token_size
, you may think of this index as the following schema. MySQL then does a series of joins to find the results which match the search term:
CREATE TABLE fulltext_index
(
docid int(11) NOT NULL,
term char(2) NOT NULL,
PRIMARY KEY (docid, term),
INDEX term_idx (term)
);
The n-gram (2) parser breaks each word in your query into segments like se
, eg
, gm
, me
, en
, nt
, ts
. For each of these n-grams, there are many results in English, so the index doesn't help much since it ends up iterating over everything anyway. Meanwhile, you can see how Chinese 随机的
would split into a much more useful 随机
and 机的
. With n-gram size set to 4, the segments are segm
, egme
, gmen
, ment
, ents
. These larger segments are much more likely to be unique, so each segment narrows down the search space significantly.
Disabling stopwords is also necessary because the ngram parser excludes all n-grams that contain any of the stopwords. For example, with an n-gram (4) parser, stopword
will be parsed into stop
, topw
, opwr
, pwor
, and word
:
stop
will be excluded because it contains "to"topw
will be excluded because it contains "to"opwr
will be keptpwor
will be excluded because it contains "or"word
will be excluded because it contains "or"Because these tokens are excluded from the index, a search for MATCH(name) AGAINST('stop' IN BOOLEAN MODE)
would not return anything unless the stopwords mechanism is disabled before creating the index.
To answer your question,
ngram_token_size
to 3, 4, or whatever your minimum search term length is.innodb_ft_enable_stopword
to 0
or OFF
.CREATE FULLTEXT INDEX companies_name_idx ON companies (name) WITH PARSER ngram;
SELECT * FROM companies WHERE MATCH(name) AGAINST('nited' IN BOOLEAN MODE);
This will also return results for nitedA
, so you might want to further filter the results from there, if that's required for your application.
© 2022 - 2024 — McMap. All rights reserved.