Python regex words boundary with unexpected results
Asked Answered
P

1

1
import re
sstring = "ON Any ON Any"
regex1 = re.compile(r''' \bON\bANY\b''', re.VERBOSE)
regex2 = re.compile(r'''\b(ON)?\b(Any)?''', re.VERBOSE)
regex3 = re.compile(r'''\b(?:ON)?\b(?:Any)?''', re.VERBOSE)
for a in regex1.findall(sstring): print(a)
print("----------")
for a in regex2.findall(sstring): print(a)
print("----------")
for a in regex3.findall(sstring): print(a)
print("----------")

('ON', '') ('', '') ('', 'Any') ('', '') ('ON', '') ('', '') ('', 'Any')

('', '')

ON

Any

ON

Any


Having read many articles on the internet and S.O. I think I still don't understand the regex word boundary: \b

The first regex doesn't give me the expected result I think it's must give me "ON Any On Any" but it still not give me that.

The second regex gives me tuples and I don't know why or understand the meaning of: ('', '')

The third regex gives prints the results on separated lines and empty lines in betweens

Could you please help me to understand that.

Plasterboard answered 5/10, 2016 at 13:41 Comment(0)
A
1

Note that to match ON ANY you need to add an escaped (since you are using re.VERBOSE flag) space between ON and ANY as \b word boundary being a zero-width assertion does not consume any text, just asserts a position between specific characters. That is the reason for your first re.compile(r''' \bON\bANY\b''', re.VERBOSE) approach failure.

Use

rx = re.compile(r''' \bON\ ANY\b ''', re.VERBOSE|re.IGNORECASE)

See the Python demo

The re.compile(r'''\b(ON)?\b(Any)?''', re.VERBOSE) returns tuples since you defined (...) capturing groups in the pattern.

The re.compile(r'''\b(?:ON)?\b(?:Any)?''', re.VERBOSE) matches optional sequences, either ON or Any, so you get those words as values. You get empty values as well because this regex can match just a word boundary (all other subpatterns are optional).

More details about word boundaries:

Acanthoid answered 5/10, 2016 at 14:4 Comment(3)
There are a lot of cases when \b is helpful. Surely it is not a universal remedy, and re.UNICODE is sometimes necessary, or even (?<!\S)/(?!\S) lookarounds can turn out better alternatives, but the main point is that it does not move the regex index. If you have a space, match it.Kelci
Thanks @Wiktor I think my misunderstanding was that I couldn't imagine a useful usage for \b in case of I have to write the the white-space between "ON" and "Any", But with Example like that I'm now imagine it may be useful sometimes, regex is \bON\b and search strings are ON#Any or ON!Any But in my in my first regex "\bON\bAny" I have to write the white-space since \b is zero-width lengthPlasterboard
Sorry I'm new to stackoverflow and still learning about its formatting, so I deleted the comment more than one time to write it correctly But I would thank your for your appreciated supportPlasterboard

© 2022 - 2024 — McMap. All rights reserved.