Help with regex include and exclude
Asked Answered
T

3

17

I would like some help with regex.

I'm trying to create an expression that will include certain strings and exclude certain strings.

For example:

I would like to include any URL containing mobility http://www.something.com/mobility/

However I would like to exclude any URL containing store http://www.something.com/store/mobility/

FYI I have many keywords that I'm using to include. Currently I am including like this /mobility|enterprise|products/i however I am not finding it able to exclude links that contain other keywords.

Thank you in advance for any help and insight you can provide.

_t

Te answered 15/3, 2011 at 15:21 Comment(1)
You must specify the language you are using Regexes from.Romans
N
12

It's possible to do all this in one regex, but you don't really need to. I think you'll have a better time if you run two separate tests: one for your include rules and one for your exclude rules. Not sure what language you're using, so I'll use JavaScript for the example:

function validate(str) {
    var required = /\b(mobility|enterprise|products)\b/i;
    var blocked = /\b(store|foo|bar)\b/i;

    return required.test(str) && !blocked.test(str);
}

If you really want to do it in one pattern, try something like this:

/(?=.*\b(mobility|enterprise|products)\b)(?!.*\b(store|foo|bar)\b)(.+)/i

The i at the end means case-insensitive, so use your language's equivalent if you're not using JavaScript.

All that being said, based on your description of the problem, I think what you REALLY want for this is string manipulation. Here's an example, again using JS:

function validate(str) {
    var required = ['mobility','enterprise','products'];
    var blocked = ['store','foo','bar'];
    var lowercaseStr = str.toLowerCase(); //or just use str if you want case sensitivity

    for (var i = 0; i < required.length; i++) {
        if (lowercaseStr.indexOf(required[i]) === -1) {
            return false;
        }
    }

    for (var j = 0; j < blocked.length; j++) {
        if (lowercaseStr.indexOf(blocked[j]) !== -1) {
            return false;
        }
    }
}
Nigrify answered 15/3, 2011 at 15:33 Comment(2)
Thank you for the assistance, but I actually need this use in Google Analytics to create a filter, which doesn't use a language for manipulation, at least that I can get to.Te
Nice. The most recent version of your single expression seems to be doing the trick. Thank you very much for your help.Te
S
20

To match a string which must have word from a set of words you can use positive lookahead as:

^(?=.*(?:inc1|inc2|...))

To not match a string which has a word from a list of stop words you can use negative lookahead as:

^(?!.*(?:ex1|ex2|...))

You can combine the above two requirements in single regex as:

^(?=.*(?:inc1|inc2|...))(?!.*(?:ex1|ex2|...))REGEX_TO_MATCH_URL$

Rubular link

Syndesis answered 15/3, 2011 at 15:29 Comment(0)
N
12

It's possible to do all this in one regex, but you don't really need to. I think you'll have a better time if you run two separate tests: one for your include rules and one for your exclude rules. Not sure what language you're using, so I'll use JavaScript for the example:

function validate(str) {
    var required = /\b(mobility|enterprise|products)\b/i;
    var blocked = /\b(store|foo|bar)\b/i;

    return required.test(str) && !blocked.test(str);
}

If you really want to do it in one pattern, try something like this:

/(?=.*\b(mobility|enterprise|products)\b)(?!.*\b(store|foo|bar)\b)(.+)/i

The i at the end means case-insensitive, so use your language's equivalent if you're not using JavaScript.

All that being said, based on your description of the problem, I think what you REALLY want for this is string manipulation. Here's an example, again using JS:

function validate(str) {
    var required = ['mobility','enterprise','products'];
    var blocked = ['store','foo','bar'];
    var lowercaseStr = str.toLowerCase(); //or just use str if you want case sensitivity

    for (var i = 0; i < required.length; i++) {
        if (lowercaseStr.indexOf(required[i]) === -1) {
            return false;
        }
    }

    for (var j = 0; j < blocked.length; j++) {
        if (lowercaseStr.indexOf(blocked[j]) !== -1) {
            return false;
        }
    }
}
Nigrify answered 15/3, 2011 at 15:33 Comment(2)
Thank you for the assistance, but I actually need this use in Google Analytics to create a filter, which doesn't use a language for manipulation, at least that I can get to.Te
Nice. The most recent version of your single expression seems to be doing the trick. Thank you very much for your help.Te
R
3

Make two regexes one for good and one for bad, and check both? (first the bad, then the good). You can do it with a single regex, but KISS is always a good rule ( http://en.wikipedia.org/wiki/KISS_principle )

I'll add that you need to consider the "ass" principle... .*ass matches ambassador and cassette, so you'll probably want to have a separator ([./\\]) before and after each word. Obscenity Filters: Bad Idea, or Incredibly Intercoursing Bad Idea?

Romans answered 15/3, 2011 at 15:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.