How to implement an autocorrect/alternative spelling search system with PHP and MySQL fulltext boolean mode for an MVP
Asked Answered
S

0

9

NB:

  • I could have used a dictionary like Pspell, Aspell or Hunspell but this case does not apply properly to business names or cities. Furthermore, I don't want to query the DB for all the suggested corrections (especially with a typeahead firing every 300ms) (more issues about these dictionnaries)

  • I could have used complementary search engine such as Elasticsearch or Sphinx but I do not have the financial or human resources allocated for this MVP. As suggested in this answer, MySQL fulltext should be enough and much less complex.

Technologies available:

MySQL 5.7 InnoDB with fulltext index boolean mode on desired fields, PHP 7.0 with php-fpm, VPS with Centos 7, corejs-typeahead

The objective:

I want to return from MySQL the results of a user search, whether it is a correct search or a misspelled search.

Example of common issues:

HYPHEN

  • words with hyphens '-' is annoying to search in partial search.

Potential solution:

  • I would have to wrap the search query within "" to search for a phrase (see [enter link description here][examples from man]. Still, it would not find a business named '"le dé-k-lé"' due to ft_min_word_len=3 AND "de" and "le" are stopwords (too frequent in many languages)

  • I could, but I will not get into the following solutions because I am not enough skilled or this is inappropriate. As suggested by MySQL manual to Modify the MySQL source or Modify a character set file or Add a new collation. For example, if I want to use the minus (-) operator to filter out some words in the future, it will not be possible anymore.

APOSTROPHE / SINGLE QUOTE

  • words with apostrophe are often searched without the apostrophe (especially on mobiles). For example "A'trego" would be input as "atrego". It will definitely be missed by a the fulltext index as "A'trego" is considered as 2 words "a" and "trego"

DOUBLE LETTERS MISSED

  • words with double letters are often missed or misspelled by the user. For example, 'Cerrutti' could be misspelled 'Cerutti' or 'Cerruti', etc.

Potential solution:

  • I could use SOUNDEX() but it is mostly designed for english language
  • I could use the levenshtein function but it would be slow for large datasets (e.g. table with all european cities). It seems that it has to do a fullscan, coupled with a typeahead, it is definitely not the the way to go. Even though some suggestions are interesting here and here

EXONYMS and PLURAL FORMS

  • Exonyms can be difficult to handle in searches (from the user perspective). For example, the Italian city Firenze is named Florenz in German, Florence in French, etc. People often switch from the exonym to the local name when they are in the city itself. Exonyms will not be handled appropriately by the previous algorithms. Furthermore, it is not a good user experience to have a city name without its exonyms. It is not good either for i18n.

Potential solution:

  • A self-made dictionary using Pspell or other similar libraries would return the string that is stored and indexed in MySQL.

DIACRITICS - similar to exonyms, it can be difficult to handle by the user. The same for i18n. For example try to find a restaurant in Łódź in Poland using your usual keyboard. A Polish and an English person will definitely not approach this string the same way.

Potential solution: - The potential solution is already managed in the front-end by the mapping used by the corejs-typeahead library. The remaining is cleaned with PHP $strCleaned = iconv('UTF-8', 'utf-8//TRANSLIT', $str);

ABBREVIATIONS & ACRONYMS - Abbreviations are used interchangeably for company names and especially for blue chips. For example, LVMH, HP, GM, GE, BMW. The same goes for cities. Not returning a company or a city when searching with the abbreviations is a big fail in term of user experience.

Potential solution:

This list is not exhaustive in the issues nor the potential solutions.

MY SOLUTION

My solution is inspired and extrapolated from an answer here

Basically, before each search, the user input should be stripped of characters like apostrophe, hyphen; simplified to remove similar consecutive letters.

Those cleaned alternative words will be stored in a column indexed with a fulltext index.

This solutions is kind of simple and appropriately respond to my requirements. But my short experience tells me to be cautious as it should definitely suffers from drawbacks (that I have not identified yet).

Below is a simplified version of my code.

PHP

// Get input from the typeahead searched word
$query = (!empty($_GET['q'])) ? strtolower($_GET['q']) : null;

// end the script if empty query
if (!isset($query)) {
    die('Invalid query.');
}

// Clean and Strip input
$query = trim($query);
$query = str_replace("'","",$query);
$query = str_replace("-","",$query);
$query = preg_replace('{(.)\1+}','$1',$query);

// filter/sanitize query
if (!preg_match("/^([0-9 '@&\-\.\pL])+$/ui", $input[$field]) !== false) {exit;}
$query = mysqli_real_escape_string($conn, $query); // I will switch to PDO prepared statement soon as mysqli_real_escape_string do not offer enough protection

MySQL Query

SELECT DISTINCT
company.company_name,
MATCH (company_name, company_alternative) AGAINST ('$query*' IN BOOLEAN MODE) AS relevance

FROM company

WHERE 
MATCH (company_name, company_alternative) AGAINST ('$query*' IN BOOLEAN MODE)
AND relevance > 1

ORDER BY
CASE
WHEN company_name = '$query' THEN 0
WHEN company_name LIKE '$query%' THEN 1
WHEN company_name LIKE '%$query' THEN 2
ELSE 3
END

LIMIT 20

MySQL Table

As a reminder, I got a two column fulltext index on (company_name,company_alternative)

**company_name**    |   **company_alternative**
l'Attrego           |   lattrego latrego attrego atrego
le Dé-K-Lé          |   dekle dekale decale
General Electric    |   GE  

THE DRAWBACKS of my solution that I have identified

  • The alternative words will not contain the common spelling mistakes until I add it manually to the alternative_name column or a machine learning process is implemented. Thus, difficult to manage and not scalable (this drawback can be dropped with not too much difficulty with machine learning as I already collect all search queries).
  • I have to manage a dynamic and complex stopword list
  • I have to rebuild the indexes due to lowering ft_min_word_len to 2

So my question, How to implement an autocorrect/alternative spelling search system with PHP and MySQL fulltext boolean mode for an MVP ?, could be reworded to,

  • Is my solution the least scalable ?

  • Do you see drawbacks that I do not see ?

  • How could I improve this approach if it is a reasonable one ?

Solfatara answered 4/4, 2018 at 0:33 Comment(2)
Really well researched question. I would like to say that LIKE won't catch all sub-string cases eg. if DB has 'Attrego' and user enters 'AAttrego' or 'Attregos' ie. anything that exceeds the target string stored in DB that the subject string($query) supposed to match would be ignored in order by ie. Attrego would match A,At,Att,Attre,.....Attrego but as soon as you add any extra letter it won't ; Attrego LIKE '%Attregos%' would failDisport
@Viney Yes there is room for improvement in the CASE. Maybe, a levenshtein function could be used there. The processing time would be reasonable. ThanksSolfatara

© 2022 - 2024 — McMap. All rights reserved.