I had this problem today:
This regex matches only English: [a-zA-Z0-9]
.
If I need support for any language in this world, what regex should I write?
I had this problem today:
This regex matches only English: [a-zA-Z0-9]
.
If I need support for any language in this world, what regex should I write?
If you use character class shorthands and a Unicode aware regex engine you can do that. The \w
class matches "word characters" (letters, digits, and underscores).
Beware of some regex flavors that don't do this so well: JavaScript uses ASCII for \d
(digits) and \w
, but Unicode for \s
(whitespace). XML does it the other way around.
Alphabet/Letter: \p{L}
Number: \p{N}
So for alphnum match for all languages, you can use: [\p{L}\p{N}]+
I was looking for a way to replace all non-alphanum chars for all languages with a space in JS and ended up using the following way to do it:
const regexForNonAlphaNum = new RegExp(/[^\p{L}\p{N}]+/ug);
someText.replace(regexForNonAlphaNum, " ");
Here as it is JS, we need to add u at end to make the regex unicode aware and g stands for global as I wanted match all instances and not just a single instance.
References:
If you use character class shorthands and a Unicode aware regex engine you can do that. The \w
class matches "word characters" (letters, digits, and underscores).
Beware of some regex flavors that don't do this so well: JavaScript uses ASCII for \d
(digits) and \w
, but Unicode for \s
(whitespace). XML does it the other way around.
[[:alpha:]]
is probably more standard. –
Geochemistry The regex below is the only one worked for me:
"\\p{LD}+" ==> LD means any letter or digit.
If you want to clean your text from any non alphanumeric characters you can use the following:
text.replaceAll("\\P{LD}+", "");//Note P is capital.
© 2022 - 2024 — McMap. All rights reserved.
[[:alpha:]]
is probably more standard. – Geochemistry