regex, problem with backreference in pattern with preg_match_all
Asked Answered
K

4

2

i wonder what is the problem with the backreference here:

preg_match_all('/__\((\'|")([^\1]+)\1/', "__('match this') . 'not this'", $matches);

it is expected to match the string between __('') but actually it returns:

match this') . 'not this

any ideas?

Kaykaya answered 18/5, 2011 at 20:9 Comment(3)
Do back refs really work in char classes?Wroth
sorry, there is a missing \ i have corrected the batternKaykaya
On to the next solution.Cigar
F
-1

Make your regex ungreedy:

preg_match_all('/__((\'|")([^\1]+)\1/U', "__('match this') . 'not this'", $matches)
Frayne answered 18/5, 2011 at 20:16 Comment(3)
Don't use back refs in char classes.Wroth
And what would you do if a string to match contains escaped quote? Like this: __('match that\'s') . 'not this' :|Outlast
[^\1] matches any char but a char with octal value 1Groningen
K
6

You can't use a backreference inside a character class because a character class matches exactly one character, and a backreference can potentially match any number of characters, or none.

What you're trying to do requires a negative lookahead, not a negated character class:

preg_match_all('/__\(([\'"])(?:(?!\1).)+\1\)/',
    "__('match this') . 'not this'", $matches);

I also changed your alternation - \'|" - to a character class - [\'"] - because it's much more efficient, and I escaped the outer parentheses to make them match literal parentheses.


EDIT: I guess I need to expand that "more efficient" remark. I took the example Friedl used to demonstrate this point and tested it in RegexBuddy.

Applied to target text abababdedfg,
^[a-g]+$ reports success after three steps, while
^(?:a|b|c|d|e|f|g)+$ takes 55 steps.

And that's for a successful match. When I try it on abababdedfz,
^[a-g]+$ reports failure after 21 steps;
^(?:a|b|c|d|e|f|g)+$ takes 99 steps.

In this particular case the impact on performance is so trivial it's not even worth mentioning. I'm just saying whenever you find yourself choosing between a character class and an alternation that both match the same things, you should almost always go with the character class. Just a rule of thumb.

Kwabena answered 18/5, 2011 at 21:14 Comment(3)
"Much more", really? How much more?Wroth
Yes, alternations are very slow, a plague that should be avoided. However, assertions are even slower.Cigar
@sln: That depends on how well the assertion is written and how it's used (just like regexes themselves). Anyway, the flexibility they provide is well worth the performance hit in most cases. But there's no excuse for using something like (a|b|c) when you can use [abc] instead.Kwabena
C
2

I'm suprised it didn't give you an unbalance parenthesis error message.

 /
   __
   (
       (\'|")
       ([^\1]+)
       \1
 /

This [^\1] will not take the contents of capture buffer 1 and put it into a character
class. It is the same as all characters that are NOT '1'.

Try this:

/__\(('|").*?\1\).*/

You can add an inner capturing parenthesis to just capture whats between quotes:
/__\(('|")(.*?)\1\).*/

Edit: If no inner delimeter is allowed, use Qtax regex.
Since, ('|").*?\1 even though non-greedy, will still match all up to the trailing anchor. In this case __('all'this'will"match'), and its better to use ('[^']*'|"[^"]*) as

Cigar answered 18/5, 2011 at 20:34 Comment(1)
Technically I think \1 will be interpreted in a character class as the character with an octal value of 1, but the point is essentially the same nonetheless (i.e. [^\1] is not doing what the OP thinks it is). Example.Cacodyl
W
1

You can use something like: /__\(("[^"]+"|'[^']+')\)/

Wroth answered 18/5, 2011 at 20:31 Comment(6)
This would be the preferred method if no inner delimeter is allowed.Cigar
Just to note that the drawback with this method is that its impossible to capture the inner data without the delimeter included.Cigar
@sln: Sure you can capture it; just use a different group for each subpattern: ~__\(("(?<DQ>[^"']+)"|'(?<SQ>[^"']+)')\)~Kwabena
@sln @Alan, or you can use (?|...) if your flavor supports it. Eg: /__\((?|"([^"]+)"|'([^']+)')\)/. Alan, I wouldn't use [^"'], the whole point of different quoting chars is that you can use one inside the other. ;)Wroth
@Alan - I should have said, theres no clean way for an inner capture without post processing logic of the independent capture buffers to determine which one captured.Cigar
@Wroth - Branch reset (if available) is an alternative, however, there are idiosyncracy's across engine's which lead me to be wary of them. Originally saying its impossible, to me though means requiring some platform dependent post processing. Ie: possible caveats.Cigar
F
-1

Make your regex ungreedy:

preg_match_all('/__((\'|")([^\1]+)\1/U', "__('match this') . 'not this'", $matches)
Frayne answered 18/5, 2011 at 20:16 Comment(3)
Don't use back refs in char classes.Wroth
And what would you do if a string to match contains escaped quote? Like this: __('match that\'s') . 'not this' :|Outlast
[^\1] matches any char but a char with octal value 1Groningen

© 2022 - 2024 — McMap. All rights reserved.