Matching (e.g.) a Unicode letter with Java regexps

B

3

17

There are many questions and answers here on StackOverflow that assume a "letter" can be matched in a regexp by [a-zA-Z]. However with Unicode there are many more characters that most people would regard as a letter (all the Greek letters, Cyrllic .. and many more. Unicode defines many blocks each of which may have "letters".

The Java definition defines Posix classes for things like alpha characters, but that is specified to only work with US-ASCII. The predefined character classes define words to consist of [a-zA-Z_0-9], which also excludes many letters.

So how do you properly match against Unicode strings? Is there some other library that gets this right?

Bea answered 15/3, 2011 at 17:10 Comment(2)

This might help as well - #4305428 – Latticed 15/3, 2011 at 17:15

@spinning_plate, thanks. I did search for existing questions but didn't find that one. – Bea 15/3, 2011 at 17:23

I

17

Here you have a very nice explanation:

http://www.regular-expressions.info/unicode.html

Some hints:

Java and .NET unfortunately do not support \X (yet). Use \P{M}\p{M}* as a substitute. To match any number of graphemes, use (?:\P{M}\p{M}*)+ instead of \X+.

In Java, the regex token \uFFFF only matches the specified code point, even when you turned on canonical equivalence. However, the same syntax \uFFFF is also used to insert Unicode characters into literal strings in the Java source code. Pattern.compile("\u00E0") will match both the single-code-point and double-code-point encodings of à, while Pattern.compile("\\u00E0") matches only the single-code-point version. Remember that when writing a regex as a Java string literal, backslashes must be escaped. The former Java code compiles the regex à, while the latter compiles \u00E0. Depending on what you're doing, the difference may be significant.

Inning answered 15/3, 2011 at 17:16 Comment(3)

Thanks. That page also points out why \p{L} is not sufficient to match any letter (you need \p{L}\p{M}*) – Bea 15/3, 2011 at 17:26

This should be edited to show the double slash in one of the Pattern.compile statements. – Escapee 1/6, 2011 at 17:23

how to replace a unicode char - example Test™ to Test. Remove the TradeMark symbol i.e ™ from the string. I tried "Test™".replace("™", "") but it works on window and fails on linux build machine etc. – Weitman 15/6, 2023 at 20:17

P

5

Are you talking about Unicode categories, like letters? These are matched by a regex of the form \p{CAT}, where "CAT" is the category code like L for any letter, or a subcategory like Lu for uppercase or Lt for title-case.

Pecuniary answered 15/3, 2011 at 17:13 Comment(5)

@Pointy - You're right, you don't need to specify the sub-category. – Pecuniary 15/3, 2011 at 17:17

Could well. I'd managed to provide the link to the Java spec and not notice that it says Both \p{L} and \p{IsL} denote the category of Unicode letters as an example :( A list of categories is here fileformat.info/info/unicode/category/index.htm – Bea 15/3, 2011 at 17:21

Although see @eLobato's answer for why that '\p{L}' isn't sufficient. – Bea 15/3, 2011 at 17:28

how to replace a unicode char - example Test™ to Test. Remove the TradeMark symbol i.e ™ from the string. I tried "Test™".replace("™", "") but it works on window and fails on linux build machine etc. – Weitman 15/6, 2023 at 20:18

@Victor like the other answers explain, replace ™ in your source code with its Unicode escape, \u2122. If it works on one machine and not another, it's probably due to the default character encoding being different on different machines, and the source code being read differently when it's compiled on the other machine. Restricting your source code to US-ASCII characters by replacing other characters with the \uXXXX escape will avoid that. – Pecuniary 16/6, 2023 at 16:2

D

2

Quoting from the JavaDoc of java.util.regex.Pattern.

Unicode support

This class is in conformance with Level 1 of Unicode Technical Standard #18: Unicode Regular Expression Guidelines, plus RL2.1 Canonical Equivalents.

Unicode escape sequences such as \u2014 in Java source code are processed as described in §3.3 of the Java Language Specification. Such escape sequences are also implemented directly by the regular-expression parser so that Unicode escapes can be used in expressions that are read from files or from the keyboard. Thus the strings "\u2014" and "\\u2014", while not equal, compile into the same pattern, which matches the character with hexadecimal value 0x2014.

Unicode blocks and categories are written with the \p and \P constructs as in Perl. \p{prop} matches if the input has the property prop, while \P{prop} does not match if the input has that property. Blocks are specified with the prefix In, as in InMongolian. Categories may be specified with the optional prefix Is: Both \p{L} and \p{IsL} denote the category of Unicode letters. Blocks and categories can be used both inside and outside of a character class.

The supported categories are those of The Unicode Standard in the version specified by the Character class. The category names are those defined in the Standard, both normative and informative. The block names supported by Pattern are the valid block names accepted and defined by UnicodeBlock.forName.

Delapaz answered 15/3, 2011 at 17:15 Comment(2)

The formatter is eating the double slash. This should say: Thus the strings "\u2014" and "\\u2014", while not equal... – Escapee 1/6, 2011 at 17:21

how to replace a unicode char - example Test™ to Test. Remove the TradeMark symbol i.e ™ from the string. I tried "Test™".replace("™", "") but it works on window and fails on linux build machine etc. – Weitman 15/6, 2023 at 20:19

Recommended topics

Hot tags