Foreign language characters in Regular expression in C#

Asked 26/1, 2015 at 18:54 Answered 14/6, 2019 at 18:55

Solved c#regex non-english

In C# code, I am trying to pass chinese characters: " 中文ABC123".

When I use alphanumeric in general using "^[a-zA-Z0-9\s]+$",

it doesn't pass for "中文ABC123" and regex validation fails.

What other expressions do I need to add for C#?

Joselow answered 26/1, 2015 at 18:54 Comment(0)

To match any letter character from any language use:

\p{L}

If you also want to match numbers:

[\p{L}\p{Nd}]+

\p{L} ... matches a character of the unicode category letter.
                it is the short form for [\p{Ll}\p{Lu}\p{Lt}\p{Lm}\p{Lo}]
                  \p{Ll} ... matches lowercase letters. (abc)
                  \p{Lu} ... matches uppercase letters. (ABC)
                  \p{Lt} ... matches titlecase letters.
                  \p{Lm} ... matches modifier letters.
                  \p{Lo} ... matches letters without case. (中文)

\p{Nd} ... matches a character of the unicode category decimal digit.

Just replace: ^[a-zA-Z0-9\s]+$ with ^[\p{L}0-9\s]+$

Exurbanite answered 26/1, 2015 at 18:55 Comment(6)

Or, if punctuation is OK, the simpler \w (word character) can be used instead of [\p{L}0-9]. – Treasonable 26/1, 2015 at 19:33

By the way Andie2302, there is a huge conflict of this one with html5 Pattern, I was getting this one for HTML5 pattern attribute and it failed to validate. Do you have any idea to work witrh HTML5 Pattern attirbute for all the languages? – Joselow 26/1, 2015 at 20:57

@Joselow JavaScript (and hence html5 input patterns) doesn't support \p, and treats \w as "latin word character", so it's trickier there: https://mcmap.net/q/24959/-regular-expression-to-match-non-ascii-characters – Treasonable 26/1, 2015 at 21:17

besides Chinese and Japanese characters, what other languages does \p{Lo} might capture? – Akins 18/10, 2017 at 15:6

@Treasonable a bit further info on \w in .NET: https://mcmap.net/q/233968/-net-regex-what-is-the-word-character-w (note that \w does not work for all languages if using ECMAScript-compliant behavior – Roz 19/5, 2019 at 16:26

String: IŠMIN-AS-AK-AŠ/20 Pattern: "/IŠMIN-AS-AK-\p{L}{2,}/" Result: ^ b"IÅ MIN-AS-AK-AÅ" How solve this? – Slaughter 17/8, 2022 at 5:59

Thanks to @Andie2302 for pointing to the right way to do it.

In Addition, for many language in the world, it's still has the 'addition character' that require main character to generate it (ex. Thai word 'เก็บ' if use only \p{L} it will display only 'เกบ', you can see that some symbolic will be missing from the word).

That's why only \p{L} will not work for all foreign language.

So, you need to use code below, to support almost foreign language

\p{L}\p{M}

NOTE:

L stand for 'Letter' (All letter from all language, but does not include the 'Mark')

M stand for 'Mark' (The 'Mark' cannot display alone, it require 'Letter' to display it)

In Addition that you need Number, use code below

\p{N}

NOTE:

N stand for 'Numeric'

Thanks to this website for very useful information

https://www.regular-expressions.info/unicode.html

Solingen answered 14/6, 2019 at 18:55 Comment(0)

Recommended topics

Hot tags