mysql - fulltext index - what is natural language mode
Asked Answered
S

2

18

I have a question regarding this article: http://dev.mysql.com/doc/refman/5.6/en/fulltext-natural-language.html.

Here I found queries like

SELECT * FROM articles
WHERE MATCH (title,body)
AGAINST ('database' IN NATURAL LANGUAGE MODE);

What I don't understand is what exactly is natural language mode? I find no exact definition nowhere.

Can any1 provide a definition? How does it work?

Sharasharai answered 16/5, 2013 at 14:53 Comment(0)
K
28

MySQL's Natural Language Full-Text Searches aim to match search queries against a corpus to find the most relevant matches. So assume we have an article that contains "I love pie" and we have documents d1, d2, d3 (the database in your case). Document 1 and 2 are about sports and religion respectively, and document 3 is about food. Your query,

SELECT * FROM articles WHERE MATCH (title,body) AGAINST ('database' IN NATURAL LANGUAGE MODE);

Will return d3, and then d2, d1 (random order of d2,d1 depending on which is more equal to the article) because d3 matches the article best.

The underlying algorithm MYSQL uses is probably the tf-idf algorithm, where tf stands for term frequency and idf for inverse document frequency. tf is as it says, just the number of times a word w in article occurs in A document. idf is based on in how many documents the word occurs. So words that occur in many documents don't contribute to deciding the most representative document. The product of tf*idf produces a score, the higher, the better the word represents a document. So 'pie' will only occur in document d3 and will thus have a high tf and a high idf (since it's the inverse). Whereas 'the' will have a high tf but a low idf which will event out the tf and give a low score.

The MYSQL Natural Language Mode also comes with a set of stopwords (the, a, some etc) and removes words that are less than 4 letters. Which can be seen in the link you provided.

Some words are ignored in full-text searches:

Any word that is too short is ignored. The default minimum length of words that are found by full-text searches is three characters for InnoDB search indexes, or four characters for MyISAM. You can control the cutoff by setting a configuration option before creating the index: innodb_ft_min_token_size configuration option for InnoDB search indexes, or ft_min_word_len for MyISAM.

Words in the stopword list are ignored. A stopword is a word such as “the” or “some” that is so common that it is considered to have zero semantic value. There is a built-in stopword list, but it can be overridden by a user-defined list. The stopword lists and related configuration options are different for InnoDB search indexes and MyISAM ones. Stopword processing is controlled by the configuration options innodb_ft_enable_stopword, innodb_ft_server_stopword_table, and innodb_ft_user_stopword_table for InnoDB search indexes, and ft_stopword_file for MyISAM ones.

Kingofarms answered 18/4, 2014 at 23:0 Comment(2)
how about a human explanation - like an example of difference? I still don;t understand what exactly it means or does differently from the other modifiers.Nasturtium
The query in the answer doesn't seem to match the "I love pie" example of the answer; i.e., the query doesn't use "I love pie" at all. The query given (using "database") comes from the MySQL documentation example where they are searching for entries in the articles table that have the word "database" in the title or body columns. If you wanted to find articles with some relevance to "I love pie", presumably you would use that query, but with ...AGAINST ('I love pie' IN NATURAL LANGUAGE MODE);Prang
A
0

And what is it for?

From what I can gather, the Full-Text index enables methods that can help provide more useful search results, including:

  • Results are ordered by relevance
  • Individual word matching: using OR conditions (this yields more results than a LIKE, which is ok because the more relevant results will be at the top).
  • Boolean mode: add operators to modify the query (eg + for AND and - for NOT)
  • Query expansion: yields further results by performing a second search, adding the "the few most highly relevant documents from the first search."
  • Ignores smaller words: ignores words less than 3 characters
  • Ignores common words: these are configured in a "stopword" list.

It seems most relevant for a user search, on larger bodies of text (eg articles), but can also be useful for querying smaller fields (eg record names).

Reference: https://dev.mysql.com/doc/refman/8.0/en/fulltext-search.html

Ampulla answered 22/9, 2023 at 4:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.