Nested regex lookahead and lookbehind
Asked Answered
T

5

9

I am having problems with the nested '+'/'-' lookahead/lookbehind in regex.

Let's say that I want to change the '*' in a string with '%' and let's say that '\' escapes the next character. (Turning a regex to sql like command ^^).

So the string

  • '*test*' should be changed to '%test%',
  • '\\*test\\*' -> '\\%test\\%', but
  • '\*test\*' and '\\\*test\\\*' should stay the same.

I tried:

(?<!\\)(?=\\\\)*\*      but this doesn't work
(?<!\\)((?=\\\\)*\*)    ...
(?<!\\(?=\\\\)*)\*      ...
(?=(?<!\\)(?=\\\\)*)\*  ...

What is the correct regex that will match the '*'s in examples given above?

What is the difference between (?<!\\(?=\\\\)*)\* and (?=(?<!\\)(?=\\\\)*)\* or if these are essentially wrong the difference between regex that have such a visual construction?

Tompkins answered 23/10, 2011 at 15:45 Comment(1)
What language do you use? And do you really expect that \*test\* stays the same and is not turned into *test*?Antalya
F
11

To find an unescaped character, you would look for a character that is preceded by an even number of (or zero) escape characters. This is relatively straight-forward.

(?<=(?<!\\)(?:\\\\)*)\*        # this is explained in Tim Pietzcker' answer

Unfortunately, many regex engines do not support variable-length look-behind, so we have to substitute with look-ahead:

(?=(?<!\\)(?:\\\\)*\*)(\\*)\*  # also look at ridgerunner's improved version

Replace this with the contents of group 1 and a % sign.

Explanation

(?=           # start look-ahead
  (?<!\\)     #   a position not preceded by a backslash (via look-behind)
  (?:\\\\)*   #   an even number of backslashes (don't capture them)
  \*          #   a star
)             # end look-ahead. If found,
(             # start group 1
  \\*         #   match any number of backslashes in front of the star
)             # end group 1
\*            # match the star itself

The look-ahead makes sure only even numbers of backslashes are taken into account. Anyway, there is no way around matching them into a group, since the look-ahead does not advance the position in the string.

Familiar answered 23/10, 2011 at 16:16 Comment(1)
Good point (also @ridgerunner) about indefinite-length lookbehind. Not everyone is using .NET or JGSoft regex engines.Rood
U
9

Ok, since Tim decided to not update his regex with my suggested mods (and Tomalak's answer is not as streamlined), here is my recommended solution:

Replace: ((?<!\\)(?:\\\\)*)\* with $1%

Here it is in the form of a commented PHP snippett:

// Replace all non-escaped asterisks with "%".
$re = '%             # Match non-escaped asterisks.
    (                # $1: Any/all preceding escaped backslashes.
      (?<!\\\\)      # At a position not preceded by a backslash,
      (?:\\\\\\\\)*  # Match zero or more escaped backslashes.
    )                # End $1: Any preceding escaped backslashes.
    \*               # Unescaped literal asterisk.
    %x';
$text = preg_replace($re, '$1%', $text);

Addendum: Non-lookaround JavaScript Solution

The above solution does require lookbehind, so it will not work in JavaScript. The following JavaScript solution does not use lookbehind:

text = text.replace(/(\\[\S\s])|\*/g,
    function(m0, m1) {
        return m1 ? m1 : '%';
    });

This solution replaces each instance of backslash-anything with itself, and each instance of * asterisk with a % percent sign.

Edit 2011-10-24: Fixed Javascript version to correctly handle cases such as: **text**. (Thanks to Alan Moore for pointing out the error in previous version.)

Ugric answered 23/10, 2011 at 16:46 Comment(2)
+1 for simplifying @Tim's regex, but your JavaScript-safe version fails on **test**. :-/ I don't think this is doable in a single JS replace operation.Lathrop
@Alan Moore - quite right. Thanks for the keen eye! However, this can be done with one replace() that uses a callback function. See latest incarnation.Ugric
L
5

Others have shown how this can be done with a lookbehind, but I'd like to make a case for not using lookarounds at all. Consider this solution (demo here):

s/\G([^*\\]*(?:\\.[^*\\]*)*)\*/$1%/g;

The bulk of the regex, [^*\\]*(?:\\.[^*\\]*)*, is an example of Friedl's "unrolled loop" idiom. It consumes as many as it can of individual characters other than asterisk or backslash, or pairs of characters consisting of a backslash followed by anything. That allows it to avoid consuming unescaped asterisks, no matter how many escaped backslashes (or other characters) precede them.

The \G anchors each match to the position where the previous match ended, or to the beginning of the input if this is the first match attempt. This prevents the regex engine from simply skipping over escaped backslashes and matching the unescaped asterisks anyway. So, each iteration of the /g controlled match consumes everything up to the next unescaped asterisk, capturing all but the asterisk in group #1. Then that's plugged back in and the * is replaced with %.

I think this is at least as readable as the lookaround approaches, and easier to understand. It does require support for \G, so it won't work in JavaScript or Python, but it works just fine in Perl.

Lathrop answered 23/10, 2011 at 23:39 Comment(0)
R
4

So you essentially want to match * only if it's preceded by an even number of backslashes (or, in other words, if it isn't escaped)? Then you don't need lookahead at all since you're only looking back, aren't you?

Search for

(?<=(?<!\\)(?:\\\\)*)\*

and replace with %.

Explanation:

(?<=       # Assert that it's possible to match before the current position...
 (?<!\\)   # (unless there are more backslashes before that)
 (?:\\\\)* # an even number of backslashes
)          # End of lookbehind
\*         # Then match an asterisk
Rood answered 23/10, 2011 at 16:1 Comment(3)
Close, but (as you know), very few regex engines support variable length lookbehind. Change the lookbehind to capture group $1 and the replace string to: $1%, and then it should work for most (but still not js).Ugric
Hm, true. Let's hope he's using .NET, then :)Rood
Now, since bliof has specified that (s)he's using Perl, I would normally have retracted my answer since it's not working in Perl because of the restrictions mentioned above. But since other answers are referencing this one, I'll be leaving it here.Rood
B
0

The problem of detecting escaped backslashes in regex has fascinated me for a while, and it wasn't until recently that I realized I was completely overcomplicating it. There are a couple of things that make it simpler, and as far as I can tell nobody here has noticed them yet:

  • Backslashes escape any character after them, not just other backslashes. So (\\.)* will eat an entire chain of escaped characters, whether they're backslashes or not. You don't have to worry about even- or odd-numbered slashes; just check for a solitary \ at the beginning or end of the chain (ridgerunner's JavaScript solution does take advantage of this).

  • Lookarounds aren't the only way to make sure you start with the first backslash in a chain. You can just look for a non-backslash character (or the start of the string).

The result is a short, simple pattern that doesn't need lookarounds or callbacks, and it's shorter than anything else I see so far.

/(?!<\\)(\\.)*\*/g

And the replacement string:

"$1%"

This works in .NET, which allows lookbehinds, and it should work for you in Perl. It's possible to do it in JavaScript, but without lookbehinds or the \G anchor, I can't see a way to do it in a one-liner. Ridgerunner's callback should work, as will a loop:

var regx = /(^|[^\\])(\\.)*\*/g;
while (input.match(regx)) {
    input = input.replace(regx, '$1$2%');
}

There are a lot of names here I recognize from other regex questions, and I know some of you are smarter than me. If I've made a mistake, please say so.

Battle answered 16/10, 2012 at 20:38 Comment(2)
Unfortunately there is a problem with this regex. When you try to match something like *\\*, it will fail (you will get the first start with the ^\* but the second one will go to [^\\] and there won't be anything for the \* to match)Tompkins
@Tompkins - You're right. And I can't think of any way to solve this in JS-style regex without a callback or a loop. In other variations of this, where you're doing something other than replacing * (such as counting escaped backslashes or something), this would still work. I don't think a one-liner can do this in JavaScript, but I'll edit with something that will. Thanks for correcting me, I had a strong feeling it was too good to be true.Battle

© 2022 - 2024 — McMap. All rights reserved.