Positive lookbehind vs non-capturing group: different behaviuor

About

Asked 4/2, 2013 at 17:46 Answered 10/8, 2017 at 13:49

Solved python regex lookbehind capturing-group

I use python regular expressions (re module) in my code and noticed different behaviour in theese cases:

re.findall(r'\s*(?:[a-z]\))?[^.)]+', 'a) xyz. b) abc.') # non-capturing group
# results in ['a) xyz', ' b) abc']

and

re.findall(r'\s*(?<=[a-z]\))?[^.)]+', 'a) xyz. b) abc.') # lookbehind
# results in ['a', ' xyz', ' b', ' abc']

What I need to get is just ['xyz', 'abc']. Why are the examples behave differently and how t get the desired result?

Decarbonate answered 4/2, 2013 at 17:46 Comment(0)

The reason a and b are included in the second case is because (?<=[a-z]\)) would first find a) and since lookaround's don't consume any character you are back at the start of string.Now [^.)]+ matches a

Now you are at ).Since you have made (?<=[a-z]\)) optional [^.)]+ matches xyz

This same thing is repeated with b) abc

remove ? from the second case and you would get the expected result i.e ['xyz', 'abc']

Stultify answered 4/2, 2013 at 17:53 Comment(12)

The non-capturing group in the first case is optional, too (if no a) in text, then match the whole text). – Decarbonate 4/2, 2013 at 17:53

@chersanya that's why i had said second case not first case..there is difference between them – Stultify 4/2, 2013 at 17:54

@chersanya also lookarounds checks for the specified pattern but it doesn't eat any characters..hence the result – Stultify 4/2, 2013 at 17:56

Oh, I've got it) The real issue is that lookarounds don't consume anything, so findall finds a in a) too. – Decarbonate 4/2, 2013 at 18:0

Would you add the reason to your answer? – Decarbonate 4/2, 2013 at 18:1

@chersanya: The "not consume anything" is not a good explanation. The text which are skipped can be considered consumed. The reason your original regex fail is plainly because of the ?. – Catalepsy 4/2, 2013 at 18:3

@nhahtdh: are you sure? Lookbehind doesn't consume text, so occurences of a in a) and abc in abc are non-overlapping. If it consumed, there would be no difference with the first case I provided. – Decarbonate 4/2, 2013 at 18:9

@chersanya: Look-behind doesn't consume text is correct. But since you make the look-behind optional, the regex is effectively \s*[^.)]+. Making look-behind optional seems to be only supported in Python and I don't know why they allow it - it doesn't make sense to do such thing, though. – Catalepsy 4/2, 2013 at 18:12

@nhahtdh: but if it consumed the text, the regex with lookbehind would (my 2nd case) be equivalent to the first case, which obviously differs from \s*[^.)]+? Or no (why)? – Decarbonate 4/2, 2013 at 18:15

@Catalepsy lookarounds can be optional..it's allowed in .net..but i agree that it really doesn't make sense – Stultify 4/2, 2013 at 18:21

@chersanya: What I meant is that, due to ?, the regex is made equivalent to \s*[^.)]+, since the result of the look-behind (whether true or false) doesn't stop the match. – Catalepsy 4/2, 2013 at 18:27

@chersanya: The "not consume anything" argument may come to play in some other case, but not this one. – Catalepsy 4/2, 2013 at 18:40

The regex you are looking for is:

re.findall(r'(?<=[a-z]\) )[^) .]+', 'a) xyz. b) abc.')

I believe the currently accepted answer by Anirudha explains the differences between your use of positive lookbehind and non-capturing well, however, the suggestion of removing the ? from after the positive lookbehind actually results in [' xyz', ' abc'] (note the included spaces).

This is due to the positive lookbehind not matching the space character as well as not including space in the main matching character class itself.

Superordinate answered 10/8, 2017 at 13:49 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags