Regex lookahead, lookbehind and atomic groups
Asked Answered
S

5

579

I found these things in my regex body but I haven't got a clue what I can use them for. Does somebody have examples so I can try to understand how they work?

(?=) - positive lookahead
(?!) - negative lookahead
(?<=) - positive lookbehind
(?<!) - negative lookbehind

(?>) - atomic group
Stephanotis answered 4/6, 2010 at 10:56 Comment(3)
Why doesn't the regex website have some simple table like this? Instead they have blocks of text explaining only. regular-expressions.info/lookaround.htmlHeadforemost
@Headforemost Try: regex101.com regexr.comPotage
Try this regex tester. It provides explanations and regex visualization.Broadleaved
S
1658

Examples

Given the string foobarbarfoo:

bar(?=bar)     finds the 1st bar ("bar" which has "bar" after it)
bar(?!bar)     finds the 2nd bar ("bar" which does not have "bar" after it)
(?<=foo)bar    finds the 1st bar ("bar" which has "foo" before it)
(?<!foo)bar    finds the 2nd bar ("bar" which does not have "foo" before it)

You can also combine them:

(?<=foo)bar(?=bar)    finds the 1st bar ("bar" with "foo" before it and "bar" after it)

Definitions

Look ahead positive (?=)

Find expression A where expression B follows:

A(?=B)

Look ahead negative (?!)

Find expression A where expression B does not follow:

A(?!B)

Look behind positive (?<=)

Find expression A where expression B precedes:

(?<=B)A

Look behind negative (?<!)

Find expression A where expression B does not precede:

(?<!B)A

Atomic groups (?>)

An atomic group exits a group and throws away alternative patterns after the first matched pattern inside the group (backtracking is disabled).

  • (?>foo|foot)s applied to foots will match its 1st alternative foo, then fail as s does not immediately follow, and stop as backtracking is disabled

A non-atomic group will allow backtracking; if subsequent matching ahead fails, it will backtrack and use alternative patterns until a match for the entire expression is found or all possibilities are exhausted.

  • (foo|foot)s applied to foots will:

    1. match its 1st alternative foo, then fail as s does not immediately follow in foots, and backtrack to its 2nd alternative;
    2. match its 2nd alternative foot, then succeed as s immediately follows in foots, and stop.

Some resources

Online testers

Shurlocke answered 4/6, 2010 at 10:56 Comment(11)
What do you mean by "finds the second bar" part? There is only one bar in the expression/string. ThanksTgroup
@Tgroup the string being tested is "foobarbarfoo". As you can see there are two foo and two bar in the string.Shurlocke
@Tgroup try to go to pythex.org and play a little bit about it. you will understand it totallyInventive
Place two bars side by side, like, barbar in the text on which these regexs will be tried.Brummell
Can someone explain when one may need an atomic group? If I only need to match with the first alternative, why would I want to give multiple alternatives?Cachucha
@Shurlocke or anyone on here. I can see that the "(?<=B)A" lookbehind is always before the actual lookup. Does it mean it must always comes before? Can this also be done "A(?<=B)"? As the name suggest it looks "behind" and it looks "ahead". Thank you if anyone can explain.Divisible
Better explanation about atomic group at this answer. Can someone edit here to complete this didatic answer?Spirit
Just a note that this answer was essential when I ended up on a project that required serious regex chops. This is an excellent, concise explanation of look-arounds.Warplane
Why are lookaheads/behinds needed? Why can't it be solved by specifying that expression A must be followed by expression B such as (A)(B)?Acrobat
Sorry but your regex doesn't work for the string foobarfoobarfoo. Check this demo. What is the issue?Broadleaved
@AmineKOUIS The issue is that that string you provided doesn't have "barbar", that is, a bar followed by another bar, INMEDIATELY after. It does have one but far away. If you instead use the same with "foobarbarfoo" it will work because it does contain "barbar".Peon
D
251

Lookarounds are zero width assertions. They check for a regex (towards right or left of the current position - based on ahead or behind), succeeds or fails when a match is found (based on if it is positive or negative) and discards the matched portion. They don't consume any character - the matching for regex following them (if any), will start at the same cursor position.

Read regular-expression.info for more details.

  • Positive lookahead:

Syntax:

(?=REGEX_1)REGEX_2

Match only if REGEX_1 matches; after matching REGEX_1, the match is discarded and searching for REGEX_2 starts at the same position.

example:

(?=[a-z0-9]{4}$)[a-z]{1,2}[0-9]{2,3}

REGEX_1 is [a-z0-9]{4}$ which matches four alphanumeric chars followed by end of line.
REGEX_2 is [a-z]{1,2}[0-9]{2,3} which matches one or two letters followed by two or three digits.

REGEX_1 makes sure that the length of string is indeed 4, but doesn't consume any characters so that search for REGEX_2 starts at the same location. Now REGEX_2 makes sure that the string matches some other rules. Without look-ahead it would match strings of length three or five.

  • Negative lookahead

Syntax:

(?!REGEX_1)REGEX_2

Match only if REGEX_1 does not match; after checking REGEX_1, the search for REGEX_2 starts at the same position.

example:

(?!.*\bFWORD\b)\w{10,30}$

The look-ahead part checks for the FWORD in the string and fails if it finds it. If it doesn't find FWORD, the look-ahead succeeds and the following part verifies that the string's length is between 10 and 30 and that it contains only word characters a-zA-Z0-9_

Look-behind is similar to look-ahead: it just looks behind the current cursor position. Some regex flavors like javascript doesn't support look-behind assertions. And most flavors that support it (PHP, Python etc) require that look-behind portion to have a fixed length.

  • Atomic groups basically discards/forgets the subsequent tokens in the group once a token matches. Check this page for examples of atomic groups
Dismissive answered 4/6, 2010 at 10:56 Comment(5)
following your explanation, does not seem to work in javascript, /(?=source)hello/.exec("source...hummhellosource") = null. Is your explanation correct?Sym
@HelinWang That explanation is correct. Your regex expects a string that is both source and hello at the same time!Dismissive
@jddxf Care to elaborate?Dismissive
@Dismissive I agree with "They check for a regex (towards right or left of the current position - based on ahead or behind), succeeds or fails when a match is found (based on if it is positive or negative) and discards the matched portion.". So lookahead should check for a regex towards right of the current position and the syntax of positive lookahead should be x(?=y)Chester
@Dismissive would (?=REGEX_1)REGEX_2 only match if REGEX_2 comes after REGEX_1?Waxplant
M
2

Why - Suppose you are playing wordle, and you've entered "ant". (Yes three-letter word, it's only an example - chill)

The answer comes back as blank, yellow, green, and you have a list of three letter words you wish to use a regex to search for? How would you do it?

To start off with you could start with the presence of the t in the third position:

[a-z]{2}t

We could improve by noting that we don't have an a

[b-z]{2}t

We could further improve by saying that the search had to have an n in it.

(?=.*n)[b-z]{2}t

or to break it down;

(?=.*n) - Look ahead, and check the match has an n in it, it may have zero or more characters before that n

[b-z]{2} - Two letters other than an 'a' in the first two positions;

t - literally a 't' in the third position

Matthei answered 4/6, 2010 at 10:56 Comment(0)
N
0

Grokking lookaround rapidly.
How to distinguish lookahead and lookbehind? Take 2 minutes tour with me:

(?=) - positive lookahead
(?<=) - positive lookbehind

Suppose

    A  B  C #in a line

Now, we ask B, Where are you?
B has two solutions to declare it location:

One, B has A ahead and has C bebind
Two, B is ahead(lookahead) of C and behind (lookhehind) A.

As we can see, the behind and ahead are opposite in the two solutions.
Regex is solution Two.

Nicely answered 4/6, 2010 at 10:56 Comment(1)
I think you got it backwards: B is ahead of A and B is behind C Alternatively, C is ahead of B and A is behind B. Or did I miss something?Value
P
-1

I used look behind to find the schema and look ahead negative to find tables missing with(nolock)

expression="(?<=DB\.dbo\.)\w+\s+\w+\s+(?!with\(nolock\))"

matches=re.findall(expression,sql)
for match in matches:
    print(match)
Polygynist answered 4/6, 2010 at 10:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.