How can I create an alphanumeric Regex for all languages?

Asked 14/7, 2011 at 11:38 Answered 31/8, 2021 at 21:28

Solved regex unicode language-agnostic non-english

I had this problem today:

This regex matches only English: [a-zA-Z0-9].

If I need support for any language in this world, what regex should I write?

Reaction answered 14/7, 2011 at 11:38 Comment(0)

If you use character class shorthands and a Unicode aware regex engine you can do that. The \w class matches "word characters" (letters, digits, and underscores).

Beware of some regex flavors that don't do this so well: JavaScript uses ASCII for \d (digits) and \w, but Unicode for \s (whitespace). XML does it the other way around.

Osteomalacia answered 14/7, 2011 at 11:40 Comment(3)

This depends highly on which language/regex syntax you're using. [[:alpha:]] is probably more standard. – Geochemistry 29/7, 2011 at 9:8

And if i don't want digits? – Throve 17/7, 2019 at 10:27

\w does not support international letters eg. Günther – Whensoever 26/7, 2022 at 9:51

Alphabet/Letter: \p{L}

Number: \p{N}

So for alphnum match for all languages, you can use: [\p{L}\p{N}]+

I was looking for a way to replace all non-alphanum chars for all languages with a space in JS and ended up using the following way to do it:

const regexForNonAlphaNum = new RegExp(/[^\p{L}\p{N}]+/ug);
someText.replace(regexForNonAlphaNum, " ");

Here as it is JS, we need to add u at end to make the regex unicode aware and g stands for global as I wanted match all instances and not just a single instance.

References:

https://www.linkedin.com/pulse/regex-one-pattern-rule-them-all-find-bring-darkness-bind-carranza/?trackingId=U6tRte%2BzTAG6O4AA3CrFmA%3D%3D

https://www.regular-expressions.info/unicode.html

Livengood answered 28/9, 2020 at 8:47 Comment(0)

If you use character class shorthands and a Unicode aware regex engine you can do that. The \w class matches "word characters" (letters, digits, and underscores).

Beware of some regex flavors that don't do this so well: JavaScript uses ASCII for \d (digits) and \w, but Unicode for \s (whitespace). XML does it the other way around.

Osteomalacia answered 14/7, 2011 at 11:40 Comment(3)

This depends highly on which language/regex syntax you're using. [[:alpha:]] is probably more standard. – Geochemistry 29/7, 2011 at 9:8

And if i don't want digits? – Throve 17/7, 2019 at 10:27

\w does not support international letters eg. Günther – Whensoever 26/7, 2022 at 9:51

Regex supporting most languages

^[A-zÀ-Ÿ\d-]*$

Sestina answered 7/1, 2021 at 5:43 Comment(0)

-1

The regex below is the only one worked for me:

"\\p{LD}+" ==> LD means any letter or digit.

If you want to clean your text from any non alphanumeric characters you can use the following:

text.replaceAll("\\P{LD}+", "");//Note P is capital.

Premonition answered 31/8, 2021 at 21:28 Comment(0)

Recommended topics

Hot tags