MySQL REGEXP word boundaries [[:<:]] [[:>:]] and double quotes
Asked Answered
B

4

23

I'm trying to match some whole-word-expressions with the MySQL REGEXP function. There is a problem, when there are double quotes involved.

The MySQL documentation says: "To use a literal instance of a special character in a regular expression, precede it by two backslash () characters."

But these queries all return 0:

SELECT '"word"' REGEXP '[[:<:]]"word"[[:>:]]';             -> 0
SELECT '"word"' REGEXP '[[:<:]]\"word\"[[:>:]]';           -> 0
SELECT '"word"' REGEXP '[[:<:]]\\"word\\"[[:>:]]';         -> 0
SELECT '"word"' REGEXP '[[:<:]] word [[:>:]]';             -> 0
SELECT '"word"' REGEXP '[[:<:]][[.".]]word[[.".]][[:>:]]'; -> 0

What else can I try to get a 1? Or is this impossible?

Bargain answered 19/9, 2013 at 17:52 Comment(4)
Might be something about character set,is it saved in db as "word"?Ingalls
Yes it is. All possible settings (known to me) are set to utf-8...Bargain
To clarify what I want to do: I need an expression, that matches for example '"This" is what I need' but NOT '"This" is what I need today'. I'm not sure if I make myself clear enough...?Bargain
I know I'm late to the discussion, but how about using a Fulltext index instead. It should take care of all these problems. As of MySQL 5.5, it is supported for INNODB tables too.Devastating
P
40

Let me quote the documentation first:

[[:<:]], [[:>:]]

These markers stand for word boundaries. They match the beginning and end of words, respectively. A word is a sequence of word characters that is not preceded by or followed by word characters. A word character is an alphanumeric character in the alnum class or an underscore (_).

From the documentation we can see the reason behind your problem and it is not caused by escaping whatsoever. The problem is that you are trying to match the word boundary [[:<:]] right at the beginning of the string which won't work because a word boundary as you can see from the documentation separates a word character from a non-word character, but in your case the first character is a " which isn't a word character so there is no word boundary, the same goes for the last " and [[:>:]].

In order for this to work, you need to change your expression a bit to this one:

"[[:<:]]word[[:>:]]"
 ^^^^^^^    ^^^^^^^

Notice how the word boundary separates a non-word character " from a word character w in the beginning and a " from d at the end of the string.

EDIT: If you always want to use a word boundary at the start and end of the string without knowing if there will be an actual boundary then you might use the following expression:

([[:<:]]|^)"word"([[:>:]]|$)

This will either match a word boundary at the beginning or the start-of-string ^ and the same for the end of the word boundary or end-of-string. I really advise you to study the data you are trying to match and look for common patterns and don't use regular expressions if they are not the right tool for the job.

SQL Fiddle Demo

Pensioner answered 19/9, 2013 at 18:41 Comment(4)
Thanks for your explanation, but read further: "To use a literal instance of a special character in a regular expression, precede it by two backslash () characters. The MySQL parser interprets one of the backslashes, and the regular expression library interprets the other. For example, to match the string 1+2 that contains the special + character, only the last of the following regular expressions is the correct one." ->SELECT '1+2' REGEXP '1\\+2'; -> 1 But you're right: It does not work at the beginning or end of the phrase, right where the boundaries are...Bargain
Unfortunately your suggestion will not work for me, because I have to match longer expressions like REGEXP '[[:<:]]"This is" what I need[[:>:]]' ... The non-word character has to be able to be anywhere in the expression, also right at the beginning or end. These strings are in the db, so I have no influence whatsoever on them.Bargain
@Heiko I know about escaping but " is not considered a special character in your situation because your are using ' and ' to delimit the string as you can see from the demo link i posted. On the other hand I have modified the answer a bit to cover your second comment so please check edit.Pensioner
I don't think I'm the first to wonder why it's such a pain to type mysql's regex boundary out...seriously? 7 characters for a simple boundary?Collaborate
K
11

In MySQL up from 8.0.4 use: \\bword\\b

ref. https://dev.mysql.com/doc/refman/8.0/en/regexp.html#regexp-compatibility

Katharinekatharsis answered 14/11, 2019 at 10:10 Comment(0)
L
5

In MySQL 8 and above

Adding to Oleksiy Muzalyev's answer

https://dev.mysql.com/doc/refman/8.0/en/regexp.html#regexp-compatibility

In MySQL 8.04 and above, you have to use:

\bword\b

Where \b represents the ICU variant for word boundary. The previous Spencer library uses [[:<:]] to represent a word boundary.

When actually using this as part of a query, I've had to escape the escape character \ so my query actually looked like

SELECT * FROM table WHERE field RLIKE '\\bterm\\b'

When querying from PHP, use SINGLE quotes to do the same thing

$sql = 'SELECT * FROM table WHERE field RLIKE ?';
$args = ['\\bterm\\b'];
...
Laroy answered 6/2, 2020 at 23:18 Comment(0)
B
2

You need to be a little more sophisticated:

SELECT '"word"' REGEXP '"word"';                                      --> 1
SELECT '"This is" what I need' REGEXP '"This is" what I need[[:>:]]'; --> 1

That is,

If the test string begins/ends with a 'letter', the precede/follow the string with [[:<:]]/[[:>:]].

This is as opposed to blindly tacking those onto the string. After all, you are already inspecting the search string for special regexp characters to escape them. This is just another task in that vein. The definition of 'letter' should match whatever the word-boundary tokens look for.

Bellaude answered 1/11, 2018 at 17:40 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.