Regex - Unicode Properties Reference and Examples
Asked Answered
M

2

3

I feel lost with the Regex Unicode Properties presented by RegexBuddy, I cannot distinguish between any of the Number properties and the Math symbol property only seems to match + but not -, *, /, ^ for instance.

RegexBuddy Unicode Properties

Is there any documentation / reference with examples on regular expressions Unicode properties?

Middleclass answered 14/1, 2010 at 6:17 Comment(0)
S
7

A list of Unicode properties can be found in http://www.unicode.org/Public/UNIDATA/PropList.txt.

The properties for each character can be found in http://www.unicode.org/Public/UNIDATA/UnicodeData.txt (1.2 MB).

In your case,

  • + (PLUS SIGN) is Sm,
  • - (HYPHEN-MINUS) is Pd,
  • * (ASTERISK) is Po,
  • / (SOLIDUS) is also Po, and
  • ^ (CIRCUMFLEX ACCENT) is Sk.

You're better off matching them with [-+*/^].

Soundless answered 14/1, 2010 at 6:36 Comment(0)
C
11

Unicode Character Properties

The ones that you’ve listed there in your example are actually all the same Unicode character property, the General Category property. Some regex systems provide access only to this one property alone; others include access to the Block property (not very useful) or to the Script property (much more useful).

A more complete explanation of the \p{Property Name} and \p{Property Name = Property Value} syntax in Perl regexes is given in the following text from page 209 of 🐪 Programming Perl, 4th edition, here reproduced with the kind permission of its author: 😼

All standard Unicode properties are actually composed of two parts, as in \p{NAME=VALUE}. All one-part properties are therefore additions to official Unicode properties. Boolean properties whose values are true can always be abbreviated as one-part properties, which allows you to write \p{Lowercase} for \p{Lowercase=True}. Other types of properties besides Boolean properties take string, numeric, or enumerated values. Perl also provides one-part aliases for all general category, script, and block properties, plus the level-one recommendations from Unicode Technical Standard #18 on Regular Expressions (version 13, from 2008-08), such as \p{Any}.

For example, \p{Armenian}, \p{IsArmenian}, and \p{Script=Armenian} all represent the same property, as do \p{Lu}, \p{GC=Lu}, \p{Uppercase_Letter}, and \p{General_Category=Uppercase_Letter}. Other examples of binary properties (those whose values are implicitly true) include \p{Whitespace}, \p{Alphabetic}, \p{Math}, and \p{Dash}. Examples of properties that aren’t binary properties include \p{Bidi_Class=Right_to_Left}, \p{Word_Break=A_Letter}, and \p{Numeric_Value=10}. The perluniprops manpage lists all properties and their aliases that Perl supports, both standard Unicode properties and the Perl specials, too.

The complete list of Unicode character properties, and their meanings, is documented in section 5 on Properties from UAX#44, the Unicode Character Database. Those eleven properties that must be supported to meet UTS#18’s RL 1.2 on Properties are these:

RL1.2 Properties

To meet this requirement, an implementation shall provide at least a minimal list of properties, consisting of the following:

  • General_Category
  • Script
  • Alphabetic
  • Uppercase
  • Lowercase
  • White_Space
  • Noncharacter_Code_Point
  • Default_Ignorable_Code_Point
  • ANY, ASCII, ASSIGNED

Note that the single-letter character class abbreviations like \w, \d, \s, \b, and their uppercase complements, as well as the POSIX-sounding names like \p{alpha}, are themselves defined in terms of Unicode character properties in UTS#18’s Annex C on Compatibility Properties.

To the best of my knowledge, the only regex engines currently meeting the Level 1 requirements of UTS#18 for Basic Unicode Support are Perl, ICU’s regex library for C and C++, Java 7’s Pattern class, and Matthew Barnett’s excellent regexp library for Python 2 and Python 3. The regexes used in Android are actually ICU’s, not Java’s as one might otherwise imagine, and so work much better with Unicode.

For Java 7, you must use the UNICODE_CHARACTER_CLASS pattern compilation flag, or an embedded (?U), to get the RL1.2a (\w &c) stuff going. For PCRE, you seem to need to embed (*PCRE_UCP), or use that as compilation flag. This may depend on how your version of php was built, which can be a problem.

Russ Cox’s RE2 library, with bindings available for C and C++, plus as Perl regex engine plugin, and now the standard regex library used by Go programming language, supports the two most important properties, both General Category and Script.

PCRE & PHP

I believe that PCRE is still a ways off from meeting RL 1.2’s requirements on properties. It handles both the General Category and the Script properties, which are the two most important and commonly used properties, but does not seem to let you get at the other nine requisite properties. Its POSIX-compatible properties lkike alpha, upper, lower, and space are specifically documented to be 7-bit ASCII only, in contravention to RL 1.2a. However, PCRE also offers these specials:

  • Xan Alphanumeric: union of properties L and N
  • Xps POSIX space: property Z or tab, NL, VT, FF, CR
  • Xsp Perl space: property Z or tab, NL, FF, CR
  • Xwd Perl word: property Xan or underscore

Note that PCRE’s \p{Xan} is still different from what Unicode says \p{alnum} must mean, because it’s missing combining marks, for example, and certain alphabetic symbols. The Perl \p{alnum} follows the Unicode definition. In the away way, PCRE’s \p{Xwd} differs from Unicode’s (and Perl’s), in that it is missing the extra alphabetics and the rest of the \p{GC=Connector_Punctuation} characters. The next revision to UTS#18 also adds \p{Join_Control} to the set of \p{word} characters.

More Properties

Of those four that meet RL 1.2 and RL 1.2a, all but Java 7 also meet (or come extremely close to meeting, sometimes under an alternate syntax like \N{…} in lieu of the \p{name=…} syntax) the new RL 2.7 on Full Properties from the proposed update to UTS#18 posted earlier this month, which reads in part:

RL2.7 Full Properties

To meet this requirement, an implementation shall support all of the properties listed below that are in the supported version of Unicode, with values that match the Unicode definitions for that version.

To meet requirement RL2.7, the implementation must satisfy the Unicode definition of the properties for the supported version of Unicode, rather than other possible definitions. However, the names used by the implementation for these properties may differ from the formal Unicode names for the properties. For example, if a regex engine already has a property called "Alphabetic", for backwards compatibility it may need to use a distinct name, such as "Unicode_Alphabetic", for the corresponding property listed in RL1.2.

[table omitted for brevity —tchrist]

The Name and Name_Alias properties are used in \p{name=…} and \N{…}. The data in NamedSequences.txt is also used in \N{…}. For more information see Section 2.5, Name Properties. The Script and Script_Extensions properties are used in \p{scx=…}. For more information, see Section 1.2.2, Script_Property. The list excludes contributory, obsolete, and deprecated properties, most provisional properties, and the Unicode_1_Name and Unicode_Radical_Stroke properties. The properties in gray are covered by RL1.2 Properties. For more information on properties, see UAX #44, Unicode Character Database [UAX44].

Unicode Property Exploration Tools

Three standalone tools that you might want to keep handy for exploring Unicode character properties are uniprops, unichars, and *uninames. They’re also available as part of the larger Unicode::Tussle suite from CPAN.

Quick demos:

$ uniprops -a 3b1
U+03B1 ‹α› \N{GREEK SMALL LETTER ALPHA}
    \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
    All Any Alnum Alpha Alphabetic Assigned Greek Is_Greek InGreek Cased Cased_Letter LC
       Changes_When_Casemapped CWCM Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll L Gr_Base
       Grapheme_Base Graph GrBase Grek Greek_And_Coptic ID_Continue IDC ID_Start IDS Letter L_
       Lowercase_Letter Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum
       X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word
    Age=1.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Greek Block=Greek_And_Coptic BLK=Greek
       Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR
       Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=A
       East_Asian_Width=Ambiguous EA=A Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX
       Script=Greek Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA
       Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U
       Line_Break=AL Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN
       Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1
       IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0
       Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Grek Script=Grek
       Sentence_Break=LO Sentence_Break=Lower SB=LO Word_Break=ALetter WB=LE Word_Break=LE

$ unichars '\pN' '\D' '\p{Latin}'
 Ⅰ      8544  02160  ROMAN NUMERAL ONE
 Ⅱ      8545  02161  ROMAN NUMERAL TWO
 Ⅲ      8546  02162  ROMAN NUMERAL THREE
 Ⅳ      8547  02163  ROMAN NUMERAL FOUR
 Ⅴ      8548  02164  ROMAN NUMERAL FIVE
 Ⅵ      8549  02165  ROMAN NUMERAL SIX
 Ⅶ      8550  02166  ROMAN NUMERAL SEVEN
 Ⅷ      8551  02167  ROMAN NUMERAL EIGHT
 (etc)

$ uninames Old English
 æ  00E6        LATIN SMALL LETTER AE
        = latin small ligature ae (1.0)
        = ash (from Old English æsc)
        * Danish, Norwegian, Icelandic, Faroese, Old English, French, IPA
        x (latin small ligature oe - 0153)
        x (cyrillic small ligature a ie - 04D5)
 ð  00F0        LATIN SMALL LETTER ETH
        * Icelandic, Faroese, Old English, IPA
        x (latin capital letter eth - 00D0)
        x (greek small letter delta - 03B4)
        x (partial differential - 2202)
 þ  00FE        LATIN SMALL LETTER THORN
        * Icelandic, Old English, phonetics
        * Runic letter borrowed into Latin script
        x (runic letter thurisaz thurs thorn - 16A6)
 œ  0153        LATIN SMALL LIGATURE OE
        = ethel (from Old English eðel)
        * French, IPA, Old Icelandic, Old English, ...
        x (latin small letter ae - 00E6)
        x (latin letter small capital oe - 0276)
 ƿ  01BF        LATIN LETTER WYNN
        = wen
        * Runic letter borrowed into Latin script
        * replaced by "w" in modern transcriptions of Old English
        * uppercase is 01F7
        x (runic letter wunjo wynn w - 16B9)
 ǣ  01E3        LATIN SMALL LETTER AE WITH MACRON
        * Old Norse, Old English
        : 00E6 0304
 ⁊  204A        TIRONIAN SIGN ET
        * Irish Gaelic, Old English, ...
        x (ampersand - 0026)
Coastwise answered 29/3, 2012 at 18:0 Comment(0)
S
7

A list of Unicode properties can be found in http://www.unicode.org/Public/UNIDATA/PropList.txt.

The properties for each character can be found in http://www.unicode.org/Public/UNIDATA/UnicodeData.txt (1.2 MB).

In your case,

  • + (PLUS SIGN) is Sm,
  • - (HYPHEN-MINUS) is Pd,
  • * (ASTERISK) is Po,
  • / (SOLIDUS) is also Po, and
  • ^ (CIRCUMFLEX ACCENT) is Sk.

You're better off matching them with [-+*/^].

Soundless answered 14/1, 2010 at 6:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.