Exclude characters from a character class
Asked Answered
L

5

30

Is there a simple way to match all characters in a class except a certain set of them? For example if in a lanaguage where I can use \w to match the set of all unicode word characters, is there a way to just exclude a character like an underscore "_" from that match?

Only idea that came to mind was to use negative lookahead/behind around each character but that seems more complex than necessary when I effectively just want to match a character against a positive match AND negative match. For example if & was an AND operator I could do this...

^(\w&[^_])+$
Limoges answered 26/6, 2013 at 18:28 Comment(6)
Which flavor of regex are you using? (e.g. Perl, Java, etc.)Tolentino
What regex flavor/language? https://mcmap.net/q/357816/-character-class-subtraction-converting-from-java-syntax-to-regexbuddy/139010Borak
In .NET you could use [\w-[_]] to exclude the underscore.Tara
The regex engine I use most frequently is java based though an old implementation (whatever CF8 uses under the hood). However I also have this need in javascript and python.Limoges
You mean ColdFusion? That's based on JavaScript, not Java. And its \w only recognizes the ASCII word characters ([A-Za-z0-9_]), not the full Unicode set. Same goes for Python's built-in re flavor.Sufi
Perl solutions are found here.Alceste
H
32

It really depends on your regex flavor.

.NET

... provides only one simple character class set operation: subtraction. This is enough for your example, so you can simply use

[\w-[_]]

If a - is followed by a nested character class, it's subtracted. Simple as that...

Java

... provides a much richer set of character class set operations. In particular you can get the intersection of two sets like [[abc]&&[cde]] (which would give c in this case). Intersection and negation together give you subtraction:

[\w&&[^_]]

Perl

... supports set operations on extended character classes as an experimental feature (available since Perl 5.18). In particular, you can directly subtract arbitrary character classes:

(?[ \w - [_] ])

All other flavors

... (that support lookaheads) allow you to mimic the subtraction by using a negative lookahead:

(?!_)\w

This first checks that the next character is not a _ and then matches any \w (which can't be _ due to the negative lookahead).

Note that each of these approaches is completely general in that you can subtract two arbitrarily complex character classes.

Harappa answered 26/6, 2013 at 18:48 Comment(0)
P
14

You can use a negation of the \w class (--> \W) and exclude it:

^([^\W_]+)$
Picardi answered 26/6, 2013 at 18:38 Comment(9)
Creative, but I don't think the OP expected this kind of answer, he wants to exclude a character in a general case. Nice idea thoughTara
@CasimiretHippolyte I should have thought of this. HamZa is right that I was looking for a more general case, but woah... \p... thank you for pointing that out as I have never used it.Limoges
@CasimiretHippolyte not all cases. This cannot be used to exclude a character from a range ;).Harappa
Not all RE engines support that.Scholl
@DonalFellows what do you mean by "that"? Negated character classes?Harappa
This works great, but only with a single class except some characters (e.g. \w without _), not with multiple classes except some characters (e.g. \w and \p{P} without _).Nickynico
@caw: your example is out of the scope of the question, and except for regex flavors that allows operations inside character classes (intersections, substractions), I doubt there's a miraculous solution (without to use your little fingers to build it with ranges) . However, for your particular example, you can do that with pcre in unicode mode: [[:alnum:]\pP] or [\p{Xan}\pP] . In other words, you have to find the best solution for each case with the predefined classes available.Picardi
@CasimiretHippolyte This was not criticism of your answer. On the contrary, I upvoted it and agree that it’s the perfect answer for this specific question. My comment was just intended as advice for people with adjacent problems.Nickynico
@caw: sorry if my answer looks rude, I am not ~totally~fluent~ in english. Your comments are welcome and the critics too. Thanks for "the perfect answer", other answers are useful too.Picardi
F
11

A negative lookahead is the correct way to go insofar as I understand your question:

^((?!_)\w)+$
Flange answered 26/6, 2013 at 18:30 Comment(0)
S
8

This can be done in python with the regex module. Something like:

import regex as re
pattern = re.compile(r'[\W_--[ ]]+')
cleanString = pattern.sub('', rawString)

You'd typically install the regex module with pip:

pip install regex

EDIT:

The regex module has two behaviours, version 0 and version 1. Set substraction (as above) is a version 1 behaviour. The pypi docs claim version 1 is the default behaviour, but you may find this is not the case. You can check with

import regex
if regex.DEFAULT_VERSION == regex.VERSION1:
  print("version 1")

To set it to version 1:

regex.DEFAULT_VERSION = regex.VERSION1

or to use version one in a single expression:

pattern = re.compile(r'(?V1)[\W_--[ ]]+')
Skirret answered 18/8, 2016 at 18:4 Comment(1)
Lifesaver on the VERSION1 bit. I would have gone crazy otherwise.Aestivate
H
6

Try using subtraction:

[\w&&[^_]]+

Note: This will work in Java, but might not in some other Regex engine.

Hutton answered 26/6, 2013 at 18:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.