Escape string for use in MySQL fulltext search
Asked Answered
P

2

26

I am using Laravel 4 and have set up the following query:

if(Input::get('keyword')) {
    $keyword = Input::get('keyword');
    $search = DB::connection()->getPdo()->quote($keyword);
    $query->whereRaw("MATCH(resources.name, resources.description, resources.website, resources.additional_info) AGAINST(? IN BOOLEAN MODE)", 
        array($search)
    );
}

This query runs fine under normal use, however, if the user enters a string such as ++, an error is thrown. Looking at the MySQl docs, there are some keywords, such as + and - which have specific purposes. Is there a function which will escape these types of special characters from a string so it can be used in a fulltext search like above without throwing any errors?

Here is an example of an error which is thrown:

{"error":{"type":"Illuminate\\Database\\QueryException","message":"SQLSTATE[42000]: Syntax error or access violation: 1064 syntax error, unexpected '+' (SQL: select * from `resources` where `duplicate` = 0 and MATCH(resources.name, resources.description, resources.website, resources.additional_info) AGAINST('c++' IN BOOLEAN MODE))","file":"\/var\/www\/html\/[...]\/vendor\/laravel\/framework\/src\/Illuminate\/Database\/Connection.php","line":555}}

Solutions I've tried:

$search = str_ireplace(['+', '-'], ' ', $keyword);

$search = filter_var($keyword, FILTER_SANITIZE_STRING);

$search = DB::connection()->getPdo()->quote($keyword);

I'm assuming I will need to use regex. What's the best approach here?

Pestana answered 22/10, 2014 at 12:6 Comment(4)
You'll want to escape that instead of remove.Indemnify
Right, that would be better. Are there any functions that will escape special characters like + and -?Pestana
But if these special characters are escaped, will they be useful in the search?Pestana
If you want to look for some hyphenated word, - is very useful, for instance.Indemnify
L
73

Only the words and operators have meaning in Boolean search mode. Operators are: +, -, > <, ( ), ~, *, ", @distance. After some research I found what word characters are: Upper case, Lower case letters, Numeral (digit) and _. I think you can use one of two approaches:

  1. Replace all non word characters with spaces (I prefer this approach). This can be accomplished with regex:

    $search = preg_replace('/[^\p{L}\p{N}_]+/u', ' ', $keyword);
    
  2. Replace characters-operators with spaces:

    $search = preg_replace('/[+\-><\(\)~*\"@]+/', ' ', $keyword);
    

Only words are indexed by full text search engine and can be searched. Non word characters isn't indexed, so it does not make sense to leave them in the search string.

References:

Lui answered 23/10, 2014 at 21:5 Comment(2)
Sidenote: I wanted to 'optimize' the 2nd option with: $prohibitedOperators = array("+","-","@","<",">","(", ")", "~","*","\""); $escapedQueryValue = str_replace($prohibitedOperators, ' ', $queryValue); Result: preg_replace() is faster!Tristis
Yes, until you have to search for a label with value as <label or which contain any of the special characters. This is not a suitable answer.Waring
W
2

While the answer from Rimas is technically correct, it will suit you only if you do not want users to use the MATCH operators, because it will strip them all completely. For example, I do want to allow use of all of them except @distance in search forms on my site, thus I've come up with this:

#Trim first
$newValue = preg_replace('/^\p{Z}+|\p{Z}+$/u', '', string);
#Remove all symbols except allowed operators and space. @distance is not included, since it's unlikely a human will be using it through UI form
$newValue = preg_replace('/[^\p{L}\p{N}_+\-<>~()"* ]/u', '', $newValue);
#Remove all operators, that can only precede a text and that are not preceded by either beginning of string or space
$newValue = preg_replace('/(?<!^| )[+\-<>~]/u', '', $newValue);
#Remove all double quotes and asterisks, that are not preceded by either beginning of string, letter, number or space
$newValue = preg_replace('/(?<![\p{L}\p{N}_ ]|^)[*"]/u', '', $newValue);
#Remove all double quotes and asterisks, that are inside text
$newValue = preg_replace('/([\p{L}\p{N}_])([*"])([\p{L}\p{N}_])/u', '', $newValue);
#Remove all opening parenthesis which are not preceded by beginning of string or space
$newValue = preg_replace('/(?<!^| )\(/u', '', $newValue);
#Remove all closing parenthesis which are not preceded by beginning of string or space or are not followed by end of string or space
$newValue = preg_replace('/(?<![\p{L}\p{N}_])\)|\)(?! |$)/u', '', $newValue);
#Remove all double quotes if the count is not even
if (substr_count($newValue, '"') % 2 !== 0) {
    $newValue = preg_replace('/"/u', '', $newValue);
}
#Remove all parenthesis if count of closing does not match count of opening ones
if (substr_count($newValue, '(') !== substr_count($newValue, ')')) {
    $newValue = preg_replace('/[()]/u', '', $newValue);
}

Unfortunately I was not able to figure out a way to do this in 1 regex, thus doing multiple runs. It's also possible, that I am missing some edge cases, as well. Any additions or corrections are appreciated: either here or create an issue for https://github.com/Simbiat/database where I implement this.

Weft answered 12/9, 2021 at 13:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.