Pattern in lookbehind
Asked Answered
T

3

3

My question is related with lookbehinds, I want to find all the first numbers after the word "this", I have the following data:

188282 this is an example of a number 12345 and 54321
188282 this is an example of a number 1234556
this is an example of a number 1234556
187293 this is another example of a number 74893 and 83978

Pattern:

this is an example of a number \d+

Output:

188282 this is an example of a number 12345 and 54321
188282 this is an example of a number 1234556
this is an example of a number 1234556
187293 this is another example of a number 74893 and 83978

To match all of them I used a more generic approach as I know I want the first number after the word “this”

Pattern:

this[^\d]+\d+

Output:

188282 this is an example of a number 12345 and 54321
188282 this is an example of a number 1234556
this is an example of a number 1234556
187293 this is another example of a number 74893 and 83978

Im tring to use lookbehinds now, as I don’t want to include part of the pattern in the results. Following my first approach:

Pattern:

(?<=this is an example of a number )\d+

Output:

188282 this is an example of a number 12345 and 54321
188282 this is an example of a number 1234556
this is an example of a number1234556
187293 this is another example of a number 74893 and 83978

Looks I’m getting there, I want to cover the last case as before, so I tried my second approach.

Pattern:

(?<=this[^\d]+)\d+

Output:

188282 this is an example of a number 12345 and 54321
188282 this is an example of a number 1234556
this is an example of a number 1234556
187293 this is another example of a number 74893 and 83978

Doesn’t match anything
Is it possible to have patterns inside lookbehinds? Am I trying a wrong approach to this problem? It’s a bit long but I wanted to show you what I tried so far instead of just asking the question

Thanks in advance

Torrefy answered 8/1, 2014 at 11:20 Comment(0)
B
1

The thing with lookbehinds is that not all languages support variable width lookbehinds (they can't support lookbehinds where what's inside can be of variable number of characters).

What you can do, might be using a lookahead and a capture group:

(?=this[^\d]+(\d+))

regex101 demo

Or maybe the \K regex character which resets a match (if your regex engine supports it).

this[^\d]+\K\d+

regex101 demo

Bartlet answered 8/1, 2014 at 11:27 Comment(4)
Thanks for the alternative approachesTorrefy
Funny enough .Net doesn't support \K (you mentioned in case it supports) but it does support variable width lookbehindsTorrefy
@JoaoRaposo Yup! That's true. Go figure why some language implement some stuff while others don't! JavaScript doesn't support either! If your language/regex engine doesn't support either (might be rare, but who knows), I'd say simply use this[^\d]+(\d+) and take only the first capture group (ignoring the main capture).Bartlet
I'm a .net dev, so I guess I'll be ok regarding this issue, but i'm def going to have a look at the differences, to be honest I wasn't aware of this matter. Thanks again for the adviceTorrefy
B
2

Yes, you can use patterns inside lookbehinds, but that you can't do in most flavor of regex is to have a variable length lookbehind. In other words, you can't use a quantifier (but a fixed quantifier like {n} is allowed) inside a lookbehind. But some regex flavour allows you to use the alternation | or a limited (like in java) quantifier {1,n}.

With .net languages variable length lookbehinds are allowed.

Blab answered 8/1, 2014 at 11:24 Comment(4)
This answer has been added to the Stack Overflow Regular Expression FAQ, under "Lookarounds".Neptunian
@aliteralmind: Cool, I will try to improve it as soon as possible. (I am currently editing several posts with the same mistake)Blab
Looking forward to it.Neptunian
This is experimentally allowed in Perl since 5.30: perldoc.pl/…Lolland
B
1

The thing with lookbehinds is that not all languages support variable width lookbehinds (they can't support lookbehinds where what's inside can be of variable number of characters).

What you can do, might be using a lookahead and a capture group:

(?=this[^\d]+(\d+))

regex101 demo

Or maybe the \K regex character which resets a match (if your regex engine supports it).

this[^\d]+\K\d+

regex101 demo

Bartlet answered 8/1, 2014 at 11:27 Comment(4)
Thanks for the alternative approachesTorrefy
Funny enough .Net doesn't support \K (you mentioned in case it supports) but it does support variable width lookbehindsTorrefy
@JoaoRaposo Yup! That's true. Go figure why some language implement some stuff while others don't! JavaScript doesn't support either! If your language/regex engine doesn't support either (might be rare, but who knows), I'd say simply use this[^\d]+(\d+) and take only the first capture group (ignoring the main capture).Bartlet
I'm a .net dev, so I guess I'll be ok regarding this issue, but i'm def going to have a look at the differences, to be honest I wasn't aware of this matter. Thanks again for the adviceTorrefy
U
0

It depends on your implementation of regex. You'll have to do some testing for sure.

I know that some implementations don't like this:

(?<=\d{1,5}) or (?<=\w*)

But they will work fine with this:

(?<=\d{5}) or (?<=\w{1000})

In other words, no repetition or flexible lengths.

Undressed answered 8/1, 2014 at 11:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.