RegEx for matching all chars except some special chars and ":)"
Asked Answered
B

4

8

I'm trying to remove all characters from a string except for #, @, :), :(. Example:

this is, a placeholder text. I wanna remove symbols like ! and ? but keep @ & # & :)

should result in (after removing the matched results):

this is a placeholder text I wanna remove symbols like  and  but keep @  #  :)

I tried:

(?! |#|@|:\)|:\()\W

It is working but in the case of :) and :(, : is still being matched. I know that it's matching because it's checking every character and the previous ones, e.g: :) matches only : but :)) matches :).

Brunner answered 11/5, 2019 at 15:18 Comment(3)
Can you provide an example string from which you want to remove/to keep certain characters?Kelseykelsi
You could just extract those sequences instead of selecting everything else.Atherosclerosis
You do not actually need to use lookarounds in case you know exactly your exceptions. Use capturing mechanism, see this answer showing how.Privatdocent
B
7

This is a tricky question, because you want to remove all symbols except for a certain whitelist. In addition, some of the symbols on the whitelist actually consist of two characters:

:)
:(

To handle this, we can first spare both colon : and parentheses, then selectively remove either one should it not be part of a smiley or frown face:

input = "this is, a (placeholder text). I wanna remove symbols like: ! and ? but keep @ & # & :)"
output = re.sub(r'[^\w\s:()@&#]|:(?![()])|(?<!:)[()]', '', input)
print(output)

this is a placeholder text I wanna remove symbols like  and  but keep @ & # & :)

The regex character class I used was:

[^\w\s:()@&#]

This will match any character which is not a word or whitespace character. It also spares your whitelist from the replacement. In the other two parts of the alternation, we then override this logic, by removing colon and parentheses should they not be part of a smiley face.

Birchard answered 11/5, 2019 at 15:37 Comment(0)
S
6

As others have shown, it is possible to write a regex that will succeed the way you have framed the problem. But this is a case where it's much simpler to write a regex to match what you want to keep. Then just join those parts together.

import re

rgx = re.compile(r'\w|\s|@|&|#|:\)|:\(')
orig = 'Blah!! Blah.... ### .... #@:):):) @@ Blah! Blah??? :):)#'
new = ''.join(rgx.findall(orig))
print(new)
Shoelace answered 11/5, 2019 at 15:53 Comment(0)
K
2

You can try the following regex (for Python).

(\w|:\)|:\(|#|@| )

With this fake sentence:

"I want to remove certain characters but want to keep certain ones like #random, and :) and :( and something like @.

If it is found in another sentence, :), do search it :( "

It finds all the characters you mentioned in the question. You can use it to find the string that contains it and write rules to carefully remove other punctuation from this string.

Kelseykelsi answered 11/5, 2019 at 15:27 Comment(0)
C
1

You may also use a simple approach: match and capture what you need to "exclude" from match and just match what you want to remove, and then just use a backreference to the capture group value:

re.sub(r'([#@\s]|:[)(])|\W', r'\1', s)
#        ^---Group 1--^->->->->^^         

See the regex demo. Here, ([#@\s]|:[)(]) matches and captures into Group 1 a #, @, whitespace chars or :( or :( substrings and \W matches without capturing any non-word char.

See Python demo:

import re
s="this is, a placeholder text. I wanna remove symbols like ! and ? but keep @ & # & :)"
print(re.sub(r'([#@\s]|:[)(])|\W', r'\1', s))
# => this is a placeholder text I wanna remove symbols like  and  but keep @  #  :)

In Python versions before 3.5, use a lambda experession as the replacement argument (due to a bug):

re.sub(r'([#@\s]|:[)(])|\W', lambda x: x.group(1) if x.group(1) else '', s)
Cobham answered 11/5, 2019 at 18:42 Comment(2)
so r'\1' chooses group 1?Brunner
@MaStErNeWbIe \1 string in the replacement pattern replaces the whole match with the contents of Group 1.Privatdocent

© 2022 - 2024 — McMap. All rights reserved.