RegEx: Look-behind to avoid odd number of consecutive backslashes
Asked Answered
B

2

12

I have user input where some tags are allowed inside square brackets. I've already wrote the regex pattern to find and validate what's inside the brackets.

In user input field opening-bracket could ([) be escaped with backslash, also backslash could be escaped with another backslash (\). I need look-behind sub-pattern to avoid odd number of consecutive backslashes before opening-bracket.

At the moment I must deal with something like this:

(?<!\\)(?:\\\\)*\[(?<inside brackets>.*?)]

It works fine, but problem is that this code still matches possible pairs of consecutive backslashes in front of brackets (even they are hidden) and look-behind just checks out if there's another single backslash appended to pairs (or directly to opening-bracket). I need to avoid them all inside look-behind group if possible.

Example:

my [test] string is ok
my \[test] string is wrong
my \\[test] string is ok
my \\\[test] string is wrong
my \\\\[test] string is ok
my \\\\\[test] string is wrong
...
etc

I work with PHP PCRE

Boss answered 8/3, 2012 at 6:0 Comment(2)
Is there a finite limit to how many odd ones? Would 1,3,5,and 7 be enough to avoid? I assume you will let through 2,4,6,8 though?Slosberg
@Slosberg Unfortunately no, it's almost infinite. I found some examples in my database with 40+ consecutive slashes. Some guys are using them to make ASCII 'drawings' then use tags to color some elements or make hyperlinks.Boss
M
12

Last time I checked, PHP did not support variable-length lookbehinds. That is why you cannot use the trivial solution (?<![^\\](?:\\\\)*\\).

The simplest workaround would be to simply match the entire thing, not just the brackets part:

(?<!\\)((?:\\\\)*)\[(?<inside_brackets>.*?)]

The difference is that now, if you're using that regex in a preg_replace, you gotta remember to prefix the replacement string by $1, to restore the backslashes being there.

Meryl answered 8/3, 2012 at 6:42 Comment(5)
+1 I found in manual that there are some limitations inside look-behind sub-pattern so I guess you're right about variable-length. Matching entire string and pulling out just what's inside the brackets is not a problem. I'm doing that at the moment. Some REGEX flavors allow full-pattern in look behinds, such as .NET but I was wonderingis it possible in PCRE. Btw, i'm using that pattern in preg_match_all(). However, thanks for your answer.Boss
No, it is not possible in PCRE; the whole-string-matching thing is simply a workaround for that. It provides the same functionality, at the cost of having to re-add those characters yourself, and excluding the extra matched region from the possible matches. This is not a problem here since the part of the string in question can only contain backslashes, so there cannot be a brackets match there.Meryl
@Wh1T3h4Ck5: The regex you accepted (?<![^\\])<etc...> was incorrect. It was doing a negative lookbehind for a negated character class (containing a backslash), thereby making it a positive lookbehind for a backslash. You need to use (?<!\\) instead! I took the liberty to edit this answer.Outstation
@TimPietzcker - yes, I saw that earlier, but I accepted this answer because there's no solution for my problem in PCRE and opening sentence of this answer explains why.Boss
@Tim, (?<![^\\]) is not equivalent to (?<=\\). The former will match at the beginning of the string if there's a match to be had there, while the latter requires the presence of at least one intervening character (i.e., a backslash). And yes, I know you're actually using (?<!\\) and not (?<=\\) (and correctly so, IMHO), but I couldn't let that remark go unchallenged. ;)Carlotacarlotta
M
0

You could do it without any look-behinds at all (the (\\\\|[^\\]) alternation eats anything but a single back-slash):

^(\\\\|[^\\])*\[(?<brackets>.*?)\] 
Mandi answered 8/3, 2012 at 10:29 Comment(7)
I need backslashes as part of look-behind group. I already have plenty of solutions w/out lookbehinds, there's one of them which works perfectly posted in question above. I don't need alternatives how to do same job in another way. Etienne Perot in his answer says that what I'm looking for is impossible with PCRE, so I have solution to believe he's wrong (which I highly doubt) or to rewrite entire project using .NET because so far .NET only uses REGEX flavor which supports full-pattern in look-behind.Boss
btw, your example has two huge mistakes... 1. anchor ^ searches at the begining of the string only, 2. group (\\\\|[^\]) requires at least one character before opening-bracket and that doesn't work if document starts with tag.Boss
@Wh1T3h4Ck5: Change the + to an * asterisk, and it works at the start of the string too. Pretty obvious.Mandi
@Wh1T3h4Ck5: And the answer posted above DOES have lookbehinds in it, what do you think this is: (?<!\\\) ?Mandi
Yes mate, it's one of reasons why I accepted that answer. Btw, pattern from that answer is exact copy of one I've originally posted in the question. Look this example "This [is] my [test][string]" and tell me does your pattern matches all tags - is, this and string? Also, my question says "I need to avoid them all (backslashes) inside look-behind group if possible" and your answer just doesn't do that. According to original question I expected answer like "Yes, that's possible followed w/ look-behind pattern" or "No, it's not possible". Simple as that.Boss
My pattern would've matched the word "is" if run on that test string, which is what exactly what it tries to do - I can only read what you actually wrote down, not what's in your head. And another thing: my answer is not a critique of Etienne's answer in any way, nor a proposal that you should not use look-behinds, but merely to offer a different perspective.Mandi
let us continue this discussion in chatMandi

© 2022 - 2024 — McMap. All rights reserved.