Positive lookbehind vs non-capturing group: different behaviuor
Asked Answered
D

2

6

I use python regular expressions (re module) in my code and noticed different behaviour in theese cases:

re.findall(r'\s*(?:[a-z]\))?[^.)]+', 'a) xyz. b) abc.') # non-capturing group
# results in ['a) xyz', ' b) abc']

and

re.findall(r'\s*(?<=[a-z]\))?[^.)]+', 'a) xyz. b) abc.') # lookbehind
# results in ['a', ' xyz', ' b', ' abc']

What I need to get is just ['xyz', 'abc']. Why are the examples behave differently and how t get the desired result?

Decarbonate answered 4/2, 2013 at 17:46 Comment(0)
S
5

The reason a and b are included in the second case is because (?<=[a-z]\)) would first find a) and since lookaround's don't consume any character you are back at the start of string.Now [^.)]+ matches a

Now you are at ).Since you have made (?<=[a-z]\)) optional [^.)]+ matches xyz

This same thing is repeated with b) abc

remove ? from the second case and you would get the expected result i.e ['xyz', 'abc']

Stultify answered 4/2, 2013 at 17:53 Comment(12)
The non-capturing group in the first case is optional, too (if no a) in text, then match the whole text).Decarbonate
@chersanya that's why i had said second case not first case..there is difference between themStultify
@chersanya also lookarounds checks for the specified pattern but it doesn't eat any characters..hence the resultStultify
Oh, I've got it) The real issue is that lookarounds don't consume anything, so findall finds a in a) too.Decarbonate
Would you add the reason to your answer?Decarbonate
@chersanya: The "not consume anything" is not a good explanation. The text which are skipped can be considered consumed. The reason your original regex fail is plainly because of the ?.Catalepsy
@nhahtdh: are you sure? Lookbehind doesn't consume text, so occurences of a in a) and abc in abc are non-overlapping. If it consumed, there would be no difference with the first case I provided.Decarbonate
@chersanya: Look-behind doesn't consume text is correct. But since you make the look-behind optional, the regex is effectively \s*[^.)]+. Making look-behind optional seems to be only supported in Python and I don't know why they allow it - it doesn't make sense to do such thing, though.Catalepsy
@nhahtdh: but if it consumed the text, the regex with lookbehind would (my 2nd case) be equivalent to the first case, which obviously differs from \s*[^.)]+? Or no (why)?Decarbonate
@Catalepsy lookarounds can be optional..it's allowed in .net..but i agree that it really doesn't make senseStultify
@chersanya: What I meant is that, due to ?, the regex is made equivalent to \s*[^.)]+, since the result of the look-behind (whether true or false) doesn't stop the match.Catalepsy
@chersanya: The "not consume anything" argument may come to play in some other case, but not this one.Catalepsy
S
0

The regex you are looking for is:

re.findall(r'(?<=[a-z]\) )[^) .]+', 'a) xyz. b) abc.')

I believe the currently accepted answer by Anirudha explains the differences between your use of positive lookbehind and non-capturing well, however, the suggestion of removing the ? from after the positive lookbehind actually results in [' xyz', ' abc'] (note the included spaces).

This is due to the positive lookbehind not matching the space character as well as not including space in the main matching character class itself.

Superordinate answered 10/8, 2017 at 13:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.