Regex match entire words only
Asked Answered
T

8

152

I have a regex expression that I'm using to find all the words in a given block of content, case insensitive, that are contained in a glossary stored in a database. Here's my pattern:

/($word)/i

The problem is, if I use /(Foo)/i then words like Food get matched. There needs to be whitespace or a word boundary on both sides of the word.

How can I modify my expression to match only the word Foo when it is a word at the beginning, middle, or end of a sentence?

Tajo answered 17/11, 2009 at 19:49 Comment(1)
Most of the answers do not address hyphenated words.Newsom
H
204

Use word boundaries:

/\b($word)\b/i

Or if you're searching for "S.P.E.C.T.R.E." like in Sinan Ünür's example:

/(?:\W|^)(\Q$word\E)(?:\W|$)/i
Hedge answered 17/11, 2009 at 19:51 Comment(8)
I was just typing up the long-hand version of this answer when you posted. :)Ashely
@RichardSimoes \b(<|>=)\b doesn't match >=Editheditha
@RichardSimoes and \b[-|+][0-9]+\b match +10 in 43E+10. Both I don't want.Editheditha
what if i want to search word which is not appended or does not contained in any other word. then this logic won't workWanigan
How would someone get the mathematical comparison operators >= and <=?Baba
upvoted! what is the difference between \bword\b and \b(word)\b i am getting difference results hence asking 1st would also match 'someword' second doesntPye
Richard Simões, @Tajo this regex doesn't work with hyphenated wordsNewsom
simpler way for the last case, nest (): /(\b(regular|words)\b|a\.m\.|\p\.m\.|a\.i\.)/ basically, keep the \b around the words that should be avoided when in between words, and don't use it on words that either a) do not work with \b such as words with . or b) that are unique enough or already composite words with space, hyphen, etc.Safar
E
86

To match any whole word you would use the pattern (\w+)

Assuming you are using PCRE or something similar:

enter image description here

Above screenshot taken from this live example: https://regex101.com/r/FGheKd/1

Matching any whole word on the commandline with (\w+)

I'll be using the phpsh interactive shell on Ubuntu 12.10 to demonstrate the PCRE regex engine through the method known as preg_match

Start phpsh, put some content into a variable, match on word.

el@apollo:~/foo$ phpsh

php> $content1 = 'badger'
php> $content2 = '1234'
php> $content3 = '$%^&'

php> echo preg_match('(\w+)', $content1);
1

php> echo preg_match('(\w+)', $content2);
1

php> echo preg_match('(\w+)', $content3);
0

The preg_match method used the PCRE engine within the PHP language to analyze variables: $content1, $content2 and $content3 with the (\w)+ pattern.

$content1 and $content2 contain at least one word, $content3 does not.

Match a number of literal words on the commandline with (dart|fart)

el@apollo:~/foo$ phpsh

php> $gun1 = 'dart gun';
php> $gun2 = 'fart gun';
php> $gun3 = 'farty gun';
php> $gun4 = 'unicorn gun';

php> echo preg_match('(dart|fart)', $gun1);
1

php> echo preg_match('(dart|fart)', $gun2);
1

php> echo preg_match('(dart|fart)', $gun3);
1

php> echo preg_match('(dart|fart)', $gun4);
0

variables gun1 and gun2 contain the string dart or fart. gun4 does not. However it may be a problem that looking for word fart matches farty. To fix this, enforce word boundaries in regex.

Match literal words on the commandline with word boundaries.

el@apollo:~/foo$ phpsh

php> $gun1 = 'dart gun';
php> $gun2 = 'fart gun';
php> $gun3 = 'farty gun';
php> $gun4 = 'unicorn gun';

php> echo preg_match('(\bdart\b|\bfart\b)', $gun1);
1

php> echo preg_match('(\bdart\b|\bfart\b)', $gun2);
1

php> echo preg_match('(\bdart\b|\bfart\b)', $gun3);
0

php> echo preg_match('(\bdart\b|\bfart\b)', $gun4);
0

So it's the same as the previous example except that the word fart with a \b word boundary does not exist in the content: farty.

Erna answered 6/1, 2014 at 17:51 Comment(1)
a.m., p.m. ain't words?Forbore
P
11

Using \b can yield surprising results. You would be better off figuring out what separates a word from its definition and incorporating that information into your pattern.

#!/usr/bin/perl

use strict; use warnings;

use re 'debug';

my $str = 'S.P.E.C.T.R.E. (Special Executive for Counter-intelligence,
Terrorism, Revenge and Extortion) is a fictional global terrorist
organisation';

my $word = 'S.P.E.C.T.R.E.';

if ( $str =~ /\b(\Q$word\E)\b/ ) {
    print $1, "\n";
}

Output:

Compiling REx "\b(S\.P\.E\.C\.T\.R\.E\.)\b"
Final program:
   1: BOUND (2)
   2: OPEN1 (4)
   4:   EXACT  (9)
   9: CLOSE1 (11)
  11: BOUND (12)
  12: END (0)
anchored "S.P.E.C.T.R.E." at 0 (checking anchored) stclass BOUND minlen 14
Guessing start of match in sv for REx "\b(S\.P\.E\.C\.T\.R\.E\.)\b" against "S.P
.E.C.T.R.E. (Special Executive for Counter-intelligence,"...
Found anchored substr "S.P.E.C.T.R.E." at offset 0...
start_shift: 0 check_at: 0 s: 0 endpos: 1
Does not contradict STCLASS...
Guessed: match at offset 0
Matching REx "\b(S\.P\.E\.C\.T\.R\.E\.)\b" against "S.P.E.C.T.R.E. (Special Exec
utive for Counter-intelligence,"...
   0           |  1:BOUND(2)
   0           |  2:OPEN1(4)
   0           |  4:EXACT (9)
  14      |  9:CLOSE1(11)
  14      | 11:BOUND(12)
                                  failed...
Match failed
Freeing REx: "\b(S\.P\.E\.C\.T\.R\.E\.)\b"
Pires answered 17/11, 2009 at 20:3 Comment(1)
I think a word will typically be a \w word, but interesting point.Grimbal
P
6

For Those who want to validate an Enum in their code you can following the guide

In Regex World you can use ^ for starting a string and $ to end it. Using them in combination with | could be what you want :

^(Male)$|^(Female)$

It will return true only for Male or Female case.

Phytobiology answered 5/8, 2020 at 10:20 Comment(2)
^ and $ match the beginning (respectively the end) of a line, therefore your example would match only if those are the only words in the line.Brinna
and this exactly what i want when i want to validate an enum! what is the problem?Phytobiology
M
3

If you are doing it in Notepad++

[\w]+ 

Would give you the entire word, and you can add parenthesis to get it as a group. Example: conv1 = Conv2D(64, (3, 3), activation=LeakyReLU(alpha=a), padding='valid', kernel_initializer='he_normal')(inputs). I would like to move LeakyReLU into its own line as a comment, and replace the current activation. In notepad++ this can be done using the follow find command:

([\w]+)( = .+)(LeakyReLU.alpha=a.)(.+)

and the replace command becomes:

\1\2'relu'\4 \n    # \1 = LeakyReLU\(alpha=a\)\(\1\)

The spaces is to keep the right formatting in my code. :)

Mcleroy answered 11/6, 2019 at 10:55 Comment(0)
S
2

use word boundaries \b,

The following (using four escapes) works in my environment: Mac, safari Version 10.0.3 (12602.4.8)

var myReg = new RegExp(‘\\\\b’+ variable + ‘\\\\b’, ‘g’)
Stryker answered 7/6, 2018 at 18:11 Comment(0)
W
0

Get all "words" in a string

/([^\s]+)/g

Basically ^/s means break on spaces (or match groups of non-spaces)
Don't forget the g for Greedy

Try it:

"Not the answer you're looking for? Browse other questions tagged regex word-boundary or ask your own question.".match(/([^\s]+)/g)

→ (17) ['Not', 'the', 'answer', "you're", 'looking', 'for?', 'Browse', 'other', 'questions', 'tagged', 'regex', 'word-boundary', 'or', 'ask', 'your', 'own', 'question.']

Wildlife answered 17/3, 2020 at 21:14 Comment(2)
If you disagree, explain and give a better solutionWildlife
Punctuation isn't a word, obliviously!Newsom
O
0

/(\s|^)TheWord(\s|$)/

console.log(/(\s|^)TheWord(\s|$)/i.test(' TheWord '));
console.log(/(\s|^)TheWord(\s|$)/i.test(' TheWord'));
console.log(/(\s|^)TheWord(\s|$)/i.test('TheWord'));
console.log(/(\s|^)TheWord(\s|$)/i.test(' this is TheWord '));
console.log(/(\s|^)TheWord(\s|$)/i.test(' this is TheWord'));
console.log(/(\s|^)TheWord(\s|$)/i.test('this is TheWord'));
console.log(/(\s|^)TheWord(\s|$)/i.test(' anything '));
console.log(/(\s|^)TheWord(\s|$)/i.test(' anything'));
console.log(/(\s|^)TheWord(\s|$)/i.test('anything'));
console.log(/(\s|^)TheWord(\s|$)/i.test(' this is anything '));
console.log(/(\s|^)TheWord(\s|$)/i.test(' this is anything'));
console.log(/(\s|^)TheWord(\s|$)/i.test('this is anything'));
Opinion answered 6/3 at 17:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.