NB:
I could have used a dictionary like Pspell, Aspell or Hunspell but this case does not apply properly to business names or cities. Furthermore, I don't want to query the DB for all the suggested corrections (especially with a typeahead firing every 300ms) (more issues about these dictionnaries)
I could have used complementary search engine such as Elasticsearch or Sphinx but I do not have the financial or human resources allocated for this MVP. As suggested in this answer, MySQL fulltext should be enough and much less complex.
Technologies available:
MySQL 5.7 InnoDB with fulltext index boolean mode on desired fields, PHP 7.0 with php-fpm, VPS with Centos 7, corejs-typeahead
The objective:
I want to return from MySQL the results of a user search, whether it is a correct search or a misspelled search.
Example of common issues:
HYPHEN
- words with hyphens '-' is annoying to search in partial search.
Potential solution:
I would have to wrap the search query within "" to search for a phrase (see [enter link description here][examples from man]. Still, it would not find a business named '"le dé-k-lé"' due to
ft_min_word_len=3
AND "de" and "le" are stopwords (too frequent in many languages)I could, but I will not get into the following solutions because I am not enough skilled or this is inappropriate. As suggested by MySQL manual to Modify the MySQL source or Modify a character set file or Add a new collation. For example, if I want to use the minus (-) operator to filter out some words in the future, it will not be possible anymore.
APOSTROPHE / SINGLE QUOTE
- words with apostrophe are often searched without the apostrophe (especially on mobiles). For example "A'trego" would be input as "atrego". It will definitely be missed by a the fulltext index as "A'trego" is considered as 2 words "a" and "trego"
DOUBLE LETTERS MISSED
- words with double letters are often missed or misspelled by the user. For example, 'Cerrutti' could be misspelled 'Cerutti' or 'Cerruti', etc.
Potential solution:
- I could use SOUNDEX() but it is mostly designed for english language
- I could use the levenshtein function but it would be slow for large datasets (e.g. table with all european cities). It seems that it has to do a fullscan, coupled with a typeahead, it is definitely not the the way to go. Even though some suggestions are interesting here and here
EXONYMS and PLURAL FORMS
- Exonyms can be difficult to handle in searches (from the user perspective). For example, the Italian city Firenze is named Florenz in German, Florence in French, etc. People often switch from the exonym to the local name when they are in the city itself. Exonyms will not be handled appropriately by the previous algorithms. Furthermore, it is not a good user experience to have a city name without its exonyms. It is not good either for i18n.
Potential solution:
- A self-made dictionary using Pspell or other similar libraries would return the string that is stored and indexed in MySQL.
DIACRITICS - similar to exonyms, it can be difficult to handle by the user. The same for i18n. For example try to find a restaurant in Łódź in Poland using your usual keyboard. A Polish and an English person will definitely not approach this string the same way.
Potential solution:
- The potential solution is already managed in the front-end by the mapping used by the corejs-typeahead library. The remaining is cleaned with PHP $strCleaned = iconv('UTF-8', 'utf-8//TRANSLIT', $str);
ABBREVIATIONS & ACRONYMS - Abbreviations are used interchangeably for company names and especially for blue chips. For example, LVMH, HP, GM, GE, BMW. The same goes for cities. Not returning a company or a city when searching with the abbreviations is a big fail in term of user experience.
Potential solution:
- First,
ft_min_word_len
should be reduced to two characters. - Second, a stopword list should be implemented
- Third, the fulltext index rebuilt.
- I do not see any other sustainable alternative
This list is not exhaustive in the issues nor the potential solutions.
MY SOLUTION
My solution is inspired and extrapolated from an answer here
Basically, before each search, the user input should be stripped of characters like apostrophe, hyphen; simplified to remove similar consecutive letters.
Those cleaned alternative words will be stored in a column indexed with a fulltext index.
This solutions is kind of simple and appropriately respond to my requirements. But my short experience tells me to be cautious as it should definitely suffers from drawbacks (that I have not identified yet).
Below is a simplified version of my code.
PHP
// Get input from the typeahead searched word
$query = (!empty($_GET['q'])) ? strtolower($_GET['q']) : null;
// end the script if empty query
if (!isset($query)) {
die('Invalid query.');
}
// Clean and Strip input
$query = trim($query);
$query = str_replace("'","",$query);
$query = str_replace("-","",$query);
$query = preg_replace('{(.)\1+}','$1',$query);
// filter/sanitize query
if (!preg_match("/^([0-9 '@&\-\.\pL])+$/ui", $input[$field]) !== false) {exit;}
$query = mysqli_real_escape_string($conn, $query); // I will switch to PDO prepared statement soon as mysqli_real_escape_string do not offer enough protection
MySQL Query
SELECT DISTINCT
company.company_name,
MATCH (company_name, company_alternative) AGAINST ('$query*' IN BOOLEAN MODE) AS relevance
FROM company
WHERE
MATCH (company_name, company_alternative) AGAINST ('$query*' IN BOOLEAN MODE)
AND relevance > 1
ORDER BY
CASE
WHEN company_name = '$query' THEN 0
WHEN company_name LIKE '$query%' THEN 1
WHEN company_name LIKE '%$query' THEN 2
ELSE 3
END
LIMIT 20
MySQL Table
As a reminder, I got a two column fulltext index on (company_name,company_alternative)
**company_name** | **company_alternative**
l'Attrego | lattrego latrego attrego atrego
le Dé-K-Lé | dekle dekale decale
General Electric | GE
THE DRAWBACKS of my solution that I have identified
- The alternative words will not contain the common spelling mistakes until I add it manually to the
alternative_name
column or a machine learning process is implemented. Thus, difficult to manage and not scalable (this drawback can be dropped with not too much difficulty with machine learning as I already collect all search queries). - I have to manage a dynamic and complex stopword list
- I have to rebuild the indexes due to lowering
ft_min_word_len
to 2
So my question, How to implement an autocorrect/alternative spelling search system with PHP and MySQL fulltext boolean mode for an MVP ?, could be reworded to,
Is my solution the least scalable ?
Do you see drawbacks that I do not see ?
How could I improve this approach if it is a reasonable one ?
Attrego
would matchA
,At
,Att
,Attre
,.....Attrego
but as soon as you add any extra letter it won't ;Attrego LIKE '%Attregos%'
would fail – Disport