I set up a complex regex to extract data from a page of text. For some reason the order of the alternation is not what I expect. A simple example would be:
((13th|(Executive |Residential)|((\w+) ){1,3})Floor)
Put simply I am trying to either get a floor number, a known named floor and, as a back-up, I capture 1-3 unknown words followed by floor just in case to review later (I in fact use a groupname to identify this but didn't want to confuse the issue)
The issue is if the string is
on the 13th Floor
I don't get 13th Floor
I get on the 13th Floor
which seems to indicate it is matching the 3rd alternation. I'd have expected it to match 13th Floor. I set this up specifically (or so I thought) to prioritize the types of matches and leave the vague ones for last only if the others are missed. I guess they weren't kidding when they said Regex is greedy but I am unclear how to set this up to be 'greedy' and behave the way I want.
\w+
(or the{1,3}
) quantifiers is not the problem. Its the fact that an NFA regex engine matches the longest leftmost substring. As long as there are three words preceedingfloor
, the other two options will never get a chance to match regardless of the greediness/laziness of any of the quantifiers. – Rainproof