Wildcard search in MySQL full-text search
Asked Answered
C

2

14

Let's say we have the following query:

SELECT *
FROM companies
WHERE name LIKE '%nited'

It returns

name
united

How do I write a query using MySQL's full-text search that will provide similar results?

Cobnut answered 19/12, 2013 at 10:51 Comment(0)
C
37

Unfortunately you cannot do this using a MySQL full-text index. You cannot retrieve '*nited states' instantly from index because left characters are the most important part of the index. However, you can search 'United Sta*'.

// the only possible wildcard full-text search in MySQL
WHERE MATCH(column) AGAINST ('United Sta*' IN BOOLEAN MODE)

MySQL's full-text performs best when searching whole words in sentences - even that can suck at times. Otherwise, I'd suggest using an external full-text engine like Solr or Sphinx. I think Sphinx allows prefix and suffix wildcards, not sure about the others.

You could go back to MySQL's LIKE clause, but again, running queries like LIKE '%nited states' or LIKE '%nited Stat%', will also suffer on performance, as it can't use the index on the first few characters. 'United Sta%' and 'Unit%States' are okay as the index can be used against the first bunch of known characters.

Another quite major caveat using MySQL's full-text indexing is the stop-word list and minimum word length settings. For example, on a shared hosting environment, you will be limited to words greater than or equal to 4-characters. So searching 'Goo' to get 'Google' would fail. The stop-word list also disallows common words like 'and', 'maybe' and 'outside' - in-fact, there are 548 stop-words all together! Again, if not using shared hosting, these settings are relatively easily to modify, but if you are, then you will get annoyed with some of the default settings.

Carvalho answered 19/12, 2013 at 11:36 Comment(0)
B
1

You can use MySQL's full-text index, but you must configure the parser to be the n-gram parser.

If your data is English (as opposed to Chinese or similar), you ought to also increase the ngram_token_size parameter to the minimum search term length you are willing to have. Otherwise, the search will be unacceptably slow.

You will also want to set innodb_ft_enable_stopword=0, otherwise an idiosyncrasy of how ngram stopword handling works will mean that many useful queries will return no results.

To explain why you must also increase ngram_token_size, you may think of this index as the following schema. MySQL then does a series of joins to find the results which match the search term:

CREATE TABLE fulltext_index
(
    docid int(11) NOT NULL,
    term  char(2) NOT NULL,
    PRIMARY KEY (docid, term),
    INDEX term_idx (term)
);

The n-gram (2) parser breaks each word in your query into segments like se, eg, gm, me, en, nt, ts. For each of these n-grams, there are many results in English, so the index doesn't help much since it ends up iterating over everything anyway. Meanwhile, you can see how Chinese 随机的 would split into a much more useful 随机 and 机的. With n-gram size set to 4, the segments are segm, egme, gmen, ment, ents. These larger segments are much more likely to be unique, so each segment narrows down the search space significantly.

Disabling stopwords is also necessary because the ngram parser excludes all n-grams that contain any of the stopwords. For example, with an n-gram (4) parser, stopword will be parsed into stop, topw, opwr, pwor, and word:

  • stop will be excluded because it contains "to"
  • topw will be excluded because it contains "to"
  • opwr will be kept
  • pwor will be excluded because it contains "or"
  • word will be excluded because it contains "or"

Because these tokens are excluded from the index, a search for MATCH(name) AGAINST('stop' IN BOOLEAN MODE) would not return anything unless the stopwords mechanism is disabled before creating the index.


To answer your question,

  1. set ngram_token_size to 3, 4, or whatever your minimum search term length is.
  2. set innodb_ft_enable_stopword to 0 or OFF.
  3. create the index with CREATE FULLTEXT INDEX companies_name_idx ON companies (name) WITH PARSER ngram;
  4. SELECT * FROM companies WHERE MATCH(name) AGAINST('nited' IN BOOLEAN MODE);

This will also return results for nitedA, so you might want to further filter the results from there, if that's required for your application.

Blinnie answered 29/12, 2022 at 23:15 Comment(1)
I'm using FULLTEXT search and appending the string with wildcard * makes no difference. It still only matches whole words. Could it be because of the n-gram parser?Skeptical

© 2022 - 2024 — McMap. All rights reserved.