Parsing multiple names - Lookbehind in the middle of regex doesn't work
Asked Answered
F

2

6

I am having trouble getting this regex to work and none of the canned ones I have found work reliably.

The desired result:

Produce the following via regex matches:

"Person One"
"Person Two"
"Person Three"

Out of these example lines:

By Person One, Person Two and Person Three
By Person One, Person Two
By Person One
By Person Two and Person Three

Here is what I have and note, if you break off the sections, I get partial matches but something with the lookbehind is throwing it off. Also, if there is a better way simpler but still reliable to pull all the "Persons" regardless of whether one, two, or three with an "and" is provided. It does not have to support more than the three but I would think as long as the "and" trails last the # of "Persons" can certainly remain variable without impacting the regex.

Saved current attempt (matches one but if you split my and lookbehind and run it then it does match all the "and" lines:

(?<=by )((\w+) (\w+))(?:,\s*)?((\w+) (\w+))?(?:\s*(?<=and ))((\w+) (\w+))

https://regex101.com/r/z3Y9TQ/1

Freon answered 11/5, 2018 at 0:10 Comment(4)
You need to make the group with the and lookbehind optionalBarrett
Why do you have capture groups around \w+? Do you need to capture the first name and last name separately?Barrett
How about this `` (\w+ \w+)[,\s]`` ? regex101.com/r/BJ3eca/1 If I misunderstand your question, I'm sorry.Obelia
@Obelia I love simplest first however if you change the "and" to "but" yours still matches and sorry I wasn't clearer the "and" being part of sentence form (comma,comma,and....) can't be optionalFreon
I
3

Instead of using Lookbehind to check for and you can use a non-capturing group like what you did with the comma:

(?<=by )(\w+ \w+)(?:,\s*)?(\w+ \w+)?(?:\sand\s)?(\w+ \w+)?

Note that you don't need to add each \w+ in a group.

Try it online.


Lookbehind in the middle of regex:

The reason why Lookbehind won't work in this case is that you have it in the middle of your regex pattern. This is not how Lookbehind works. The matching starts from the beginning until it reaches the Lookbehind (e.g., (?<=prior)subsequent), it matches what comes after it (i.e., subsequent), then and only then it "looks behind" expecting to find prior. So basically what comes before the Lookbehind must be followed by what's after the (?<=) (i.e., subsequent), but at the same time, what comes after the Lookbehind must be preceded by what's inside it (i.e., prior). See where the problem comes from?

Therefore, in your example, the only way to match the full sentence with the Lookbehind in the middle is to also include the and in the pattern which makes the Lookbehind redundant.

To illustrate, take a look at this demo. As you can see, the pattern ((?<=and )Person matches Person when it comes after and. Now let's change it to Two (?<=and )Person, you'd probably think it'll work, but it actually finds no matches and that's because it first looks for Two, then it looks for Person, but it doesn't find it (because "Person" doesn't immediately follow "Two ") so it doesn't proceed to the next step which is the Lookbehind.

The only way to make the Lookbehind work in this case, is to also include the and right after the Two like this: Two and (?<=and )Person, which makes the Lookbehind redundant as explained above.

A similar behavior, but for Lookaheads (i.e., when Lookahead comes in the middle) is very well explained in this awesome answer be revo.

Hope that helps.

Imidazole answered 11/5, 2018 at 0:34 Comment(4)
You went with my original go-to. Do you happen to know what I did wrong or can you get the lookbehind and approach to work? TIA!Freon
I should have said I had the \w+ grouped so I could optionally also capture the first/last names separately but you're absolutely right. What I can't figure out is why the darn lookbehind works fine if I literally cut it and paste it by itself so I know it's syntactically correct but I'm missing something else.Freon
I updated my answer to explain why Lookbehind doesn't work in this case.Imidazole
@CollinChaffin I updated the answer again because the previous explanation was not accurate (just in case you have read it already). Hope this explanation makes sense to you.Imidazole
B
1

I can't seem to get the lookbehind for and working, but this works with a non-capturing group:

(?<=by )(\w+ \w+)(?:, *)?(\w+ \w+)?(?: *)(?:and (\w+ \w+))?

I changed \s to space in the regexp so it won't match the newlines.

DEMO

Barrett answered 11/5, 2018 at 0:34 Comment(1)
Your is very similar to @ahmed above but yours is also capturing the extra trailing space. I too could not get the lookbehind working still can't figure out why but these are all awesome contribs so thanks!Freon

© 2022 - 2024 — McMap. All rights reserved.