Extending regular expression syntax to say 'does not contain text XYZ'
Asked Answered
H

2

12

I have an app where users can specify regular expressions in a number of places. These are used while running the app to check if text (e.g. URLs and HTML) matches the regexes. Often the users want to be able to say where the text matches ABC and does not match XYZ. To make it easy for them to do this I am thinking of extending regular expression syntax within my app with a way to say 'and does not contain pattern'. Any suggestions on a good way to do this?

My app is written in C# .NET 3.5.

My plan (before I got the awesome answers to this question...)

Currently I'm thinking of using the ¬ character: anything before the ¬ character is a normal regular expression, anything after the ¬ character is a regular expression that can not match in the text to be tested.

So I might use some regexes like this (contrived) example:

on (this|that|these) day(s)?¬(every|all) day(s) ?

Which for example would match 'on this day the man said...' but would not match 'on this day and every day after there will be ...'.

In my code that processes the regex I'll simply split out the two parts of the regex and process them separately, e.g.:

    public bool IsMatchExtended(string textToTest, string extendedRegex)
    {
        int notPosition = extendedRegex.IndexOf('¬');

        // Just a normal regex:
        if (notPosition==-1)
            return Regex.IsMatch(textToTest, extendedRegex);

        // Use a positive (normal) regex and a negative one
        string positiveRegex = extendedRegex.Substring(0, notPosition);
        string negativeRegex = extendedRegex.Substring(notPosition + 1, extendedRegex.Length - notPosition - 1);

        return Regex.IsMatch(textToTest, positiveRegex) && !Regex.IsMatch(textToTest, negativeRegex);
    }

Any suggestions on a better way to implement such an extension? I'd need to be slightly cleverer about splitting the string on the ¬ character to allow for it to be escaped, so wouldn't just use the simple Substring() splitting above. Anything else to consider?

Alternative plan

In writing this question I also came across this answer which suggests using something like this:

^(?=(?:(?!negative pattern).)*$).*?positive pattern

So I could just advise people to use a pattern like, instead of my original plan, when they want to NOT match certain text.

Would that do the equivalent of my original plan? I think it's quite an expensive way to do it peformance-wise, and since I'm sometimes parsing large html documents this might be an issue, whereas I suppose my original plan would be more performant. Any thoughts (besides the obvious: 'try both and measure them!')?

Possibly pertinent for performance: sometimes there will be several 'words' or a more complex regex that can not be in the text, like (every|all) in my example above but with a few more variations.

Why!?

I know my original approach seems weird, e.g. why not just have two regexes!? But in my particular application administrators provide the regular expressions and it would be rather difficult to give them the ability to provide two regular expressions everywhere they can currently provide one. Much easier in this case to have a syntax for NOT - just trust me on that point.

I have an app that lets administrators define regular expressions at various configuration points. The regular expressions are just used to check if text or URLs match a certain pattern; replacements aren't made and capture groups aren't used. However, often they would like to specify a pattern that says 'where ABC is not in the text'. It's notoriously difficult to do NOT matching in regular expressions, so the usual way is to have two regular expressions: one to specify a pattern that must be matched and one to specify a pattern that must not be matched. If the first is matched and the second is not then the text does match. In my application it would be a lot of work to add the ability to have a second regular expression at each place users can provide one now, so I would like to extend regular expression syntax with a way to say 'and does not contain pattern'.

Heuristic answered 3/5, 2011 at 11:4 Comment(7)
I will say that as a casual user of regular expressions, nothing irritates me more than having to learn another particular flavor, with features that I can't use in other regex engines. (Though admittedly your users aren't me, so perhaps they have other preferences.)Turkoman
Just in case this is ASP.NET, you may want to be careful of DoS attacks. Regexes can be constructed to cause very high server load.Roughcast
It's using the ¬ character effectively the same as providing two regular expressions with a separator? You might as well provide any number of regular expressions in the same configuration: ☺pattern1☺patternt2☺patternt3☹negative1☹negative2. Of course, most flavors have strong enough patterns for that, as the answers show.Mediant
@Justin - Keep in mind, the "users" are the administrators. Mostly, they are not considered to be hostile.Mediant
@Mediant - That's fair, although it's certainly possible to do it by accident. If they're not fluent enough in regex to write one themselves, they're unlikely to be able to performance-tune it. Either way, just something to keep in mind. I like to err on the side of caution.Roughcast
By the way, @Heuristic - Is there any reason you can't provide two textboxes instead of making a new regex flavor? E.g. one box labeled "Match this pattern" and one labeled "But not this pattern"?Roughcast
@Justin - I can't easily provide two textboxes because the complexity of the UI and the datastructures where the regexes are stored would mean lots of changes. It's definitely possible but cost/benefit-wise a dirty hack like my proposal is a better solution in this case. Sadly.Heuristic
S
18

You don't need to introduce a new symbol. There already is support for what you need in most regex engines. It's just a matter of learning it and applying it.

You have concerns about performance, but have you tested it? Have you measured and demonstrated those performance problems? It will probably be just fine.

Regex works for many many people, in many many different scenarios. It probably fits your requirements, too.

Also, the complicated regex you found on the other SO question, can be simplified. There are simple expressions for negative and positive lookaheads and lookbehinds.
?! ?<! ?= ?<=


Some examples

Suppose the sample text is <tr valign='top'><td>Albatross</td></tr>

Given the following regex's, these are the results you will see:

  1. tr - match
  2. td - match
  3. ^td - no match
  4. ^tr - no match
  5. ^<tr - match
  6. ^<tr>.*</tr> - no match
  7. ^<tr.*>.*</tr> - match
  8. ^<tr.*>.*</tr>(?<tr>) - match
  9. ^<tr.*>.*</tr>(?<!tr>) - no match
  10. ^<tr.*>.*</tr>(?<!Albatross) - match
  11. ^<tr.*>.*</tr>(?<!.*Albatross.*) - no match
  12. ^(?!.*Albatross.*)<tr.*>.*</tr> - no match

Explanations

The first two match because the regex can apply anywhere in the sample (or test) string. The second two do not match, because the ^ says "start at the beginning", and the test string does not begin with td or tr - it starts with a left angle bracket.

The fifth example matches because the test string starts with <tr. The sixth does not, because it wants the sample string to begin with <tr>, with a closing angle bracket immediately following the tr, but in the actual test string, the opening tr includes the valign attribute, so what follows tr is a space. The 7th regex shows how to allow the space and the attribute with wildcards.

The 8th regex applies a positive lookbehind assertion to the end of the regex, using ?<. It says, match the entire regex only if what immediately precedes the cursor in the test string, matches what's in the parens, following the ?<. In this case, what follows that is tr>. After evaluating ``^.*, the cursor in the test string is positioned at the end of the test string. Therefore, thetr>` is matched against the end of the test string, which evaluates to TRUE. Therefore the positive lookbehind evaluates to true, therefore the overall regex matches.

The ninth example shows how to insert a negative lookbehind assertion, using ?<! . Basically it says "allow the regex to match if what's right behind the cursor at this point, does not match what follows ?<! in the parens, which in this case is tr>. The bit of regex preceding the assertion, ^<tr.*>.*</tr> matches up to and including the end of the string. Because the pattern tr> does match the end of the string. But this is a negative assertion, therefore it evaluates to FALSE, which means the 9th example is NOT a match.

The tenth example uses another negative lookbehind assertion. Basically it says "allow the regex to match if what's right behind the cursor at this point, does not match what's in the parens, in this case Albatross. The bit of regex preceding the assertion, ^<tr.*>.*</tr> matches up to and including the end of the string. Checking "Albatross" against the end of the string yields a negative match, because the test string ends in </tr>. Because the pattern inside the parens of the negative lookbehind does NOT match, that means the negative lookbehind evaluates to TRUE, which means the 10th example is a match.

The 11th example extends the negative lookbehind to include wildcards; in english the result of the negative lookbehind is "only match if the preceding string does not include the word Albatross". In this case the test string DOES include the word, the negative lookbehind evaluates to FALSE, and the 11th regex does not match.

The 12th example uses a negative lookahead assertion. Like lookbehinds, lookaheads are zero-width - they do not move the cursor within the test string for the purposes of string matching. The lookahead in this case, rejects the string right away, because .*Albatross.* matches; because it is a negative lookahead, it evaluates to FALSE, which mean the overall regex fails to match, which means evaluation of the regex against the test string stops there.

example 12 always evaluates to the same boolean value as example 11, but it behaves differently at runtime. In ex 12, the negative check is performed first, at stops immediately. In ex 11, the full regex is applied, and evaluates to TRUE, before the lookbehind assertion is checked. So you can see that there may be performance differences when comparing lookaheads and lookbehinds. Which one is right for you depends on what you are matching on, and the relative complexity of the "positive match" pattern and the "negative match" pattern.

For more on this stuff, read up at http://www.regular-expressions.info/

Or get a regex evaluator tool and try out some tests.

like this tool:
enter image description here

source and binary

Spare answered 3/5, 2011 at 12:43 Comment(6)
Thanks for the massive post! When I try the 11th example, which according to your description is what I want, as you say it does not match <tr valign='top'><td>Albatross</td></tr> BUT also it does not match <tr valign='top'><td>Albat XXX ross</td></tr>. I guess this is because the negative lookahead only looks for 'Albatross' after the text that matched, not anywhere within the sample text? If so, how could I change the regex so the description is "only match if the sample text matches ^<tr.*>.*</tr> and the entire sample text does not include the word Albatross"Heuristic
My earlier comment was before I noticed example 12. This seems to do what I want. I think 12 gives different results to example 11 if 'Albatross' is found within the pattern that's not part of the lookbehind.Heuristic
This is great. I'm so glad I proposed such a poxy solution to elicit these great answers - many thanks!Heuristic
Glad to help, Rory. Regarding your comment for ex11, contrary to what you report, I get a positive match with the regex from 11, using test string <tr valign='top'><td>Albat XXX ross</td></tr> . This is what I would expect, since the regex for 11 has a negative lookbehind that asserts that "Albatross" is NOT in the preceding text, anywhere. This assertion is true, therefore the regex matches. Sounds like you're reporting something different. I'd check your code on that. You mentioned a "negative lookahead", but ex11 doesn't have that. It has a negative lookbehind, and ... (see next)Spare
...to answer your question, NO, the negative lookbehind checks backwards, even through the matched (and possibly captured) text. So, something else is askew in your test. The regexi for ex11 and ex12 will give the same results, for all test strings.Spare
Ah yes, lookbehind indeed. I was using gskinner.com/RegExr (and not thinking for myself) which reports lookbehinds as lookaheads :-/ Using RegExr it behaves as I said - different to your results - but using .net (e.g. tinyurl.com/3gcocd) it behaves as per your results, i.e. pattern ^<tr.*>.*</tr>(?<!.*Albatross.*) does match text <tr valign='top'><td>Albat XXX ross</td></tr>. Either I'm really fat-fingered on RegExr or it's a difference with Flash's regular expression engine or RegExr's use of it.Heuristic
V
9

You can easily accomplish your objectives using a single regex. Here is an example which demonstrates one way to do it. This regex matches a string containing "cat" AND "lion" AND "tiger", but does NOT contain "dog" OR "wolf" OR "hyena":

if (Regex.IsMatch(text, @"
    # Match string containing all of one set of words but none of another.
    ^                # anchor to start of string.
    # Positive look ahead assertions for required substrings.
    (?=.*?  cat   )  # Assert string has: 'cat'.
    (?=.*?  lion  )  # Assert string has: 'lion'.
    (?=.*?  tiger )  # Assert string has: 'tiger'.
    # Negative look ahead assertions for not-allowed substrings.
    (?!.*?  dog   )  # Assert string does not have: 'dog'.
    (?!.*?  wolf  )  # Assert string does not have: 'wolf'.
    (?!.*?  hyena )  # Assert string does not have: 'hyena'.
    ",
    RegexOptions.Singleline | RegexOptions.IgnoreCase |
    RegexOptions.IgnorePatternWhitespace)) {
    // Successful match
} else {
    // Match attempt failed
} 

You can see the needed pattern. When assembling the regex, be sure to run each of the user provided sub-strings through the Regex.escape() method to escape any metacharacters it may contain (i.e. (, ), | etc). Also, the above regex is written in free-spacing mode for readability. Your production regex should NOT use this mode, otherwise whitespace within the user substrings would be ignored.

You may want to add \b word boundaries before and after each "word" in each assertion if the substrings consist of only real words.

Note also that the negative assertion can be made a bit more efficient using the following alternative syntax:

(?!.*?(?:dog|wolf|hyena))

Vaccinia answered 3/5, 2011 at 15:43 Comment(2)
This is great. I'm so glad I proposed such a poxy solution to elicit these great answers - many thanks! During my quick testing with gskinner.com/RegExr I needed to add something to the pattern to match at least one character, e.g. putting . at the end of the pattern. I haven't yet checked if this is needed with .net regex.Heuristic
@Heuristic - RegExr has issue displaying zero-width patterns (it shows them in red), you don't generally need that dot on any flavor, you will get a positive match either way.Mediant

© 2022 - 2024 — McMap. All rights reserved.