Regex to remove special/invisible characters
Asked Answered
R

5

7

the problem is to remove some strange, characters from domain name, but keep special unicode characters such as accented letters (german, danish of polish language) For example: radis­son-blu.es, you cant see, but there's additional char between ss. (Try to copy to notepad to see it).

I've seen many posts about similar problems, but each solution doesn't remove that special character, or it's removing it, but also other special characters i need to keep.

Rhody answered 16/7, 2012 at 13:48 Comment(0)
O
3

replace regex [^\w\s.,!@#$%^&*()=+~`-] with empty string

Overfly answered 16/7, 2012 at 13:51 Comment(2)
You edited it after my comment; I wasn't wrong to begin with.Interdental
i checked the modified version, it seems working as i wanted to. Thank you very much.Rhody
I
2

The character you're (not) seeing there is U+00AD Soft Hyphen. You can reference it in a regular expression using \u00ad, e.g.:

Regex.Replace(str, @"\u00ad", "");

But for a single-character replacement you could also use string.Replace as well.

Interdental answered 16/7, 2012 at 13:51 Comment(1)
I know, but the point is not to deal only with this one character, but with whole kind.Rhody
R
0

'\xAD' is a soft hyphen (the codepoint's name is "SOFT HYPHEN").

According to the Unicode codepoint database, its category is "Cf" (or "Format"), so it can be matched with the regex @"\p{Cf}".

Strangely, Microsoft Visual C# 2010 Express says that it doesn't match @"\p{Cf}", but instead matches @"\p{Pd}" ("Dash Punctuation"), the same category as the normal hyphen.

Recall answered 16/7, 2012 at 19:46 Comment(0)
C
0

Here is a much simpler version: [^\x00-\x7F]

Test it on Regex101: https://regex101.com/r/jHVEb5/1

Coaming answered 23/5 at 1:31 Comment(0)
B
-2

This works for me:

[\x00-\x1f]|[\x81\x8d\x8d\x8f\x90\x9d\xa0\u2060\uFEFF]
Bainite answered 10/3, 2017 at 15:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.