Unicode Character Properties
The ones that you’ve listed there in your example are actually all the same Unicode character property, the General Category property. Some regex systems provide access only to this one property alone; others include access to the Block property (not very useful) or to the Script property (much more useful).
A more complete explanation of the \p{Property Name}
and \p{Property Name = Property Value}
syntax in Perl regexes is given in the following text from page 209 of 🐪 Programming Perl, 4th edition, here reproduced with the kind permission of its author: 😼
All standard Unicode properties are actually composed of two parts, as in
\p{NAME=VALUE}
. All one-part properties are therefore additions to official Unicode
properties. Boolean properties whose values are true can always be abbreviated
as one-part properties, which allows you to write \p{Lowercase}
for \p{Lowercase=True}
. Other types of properties besides Boolean properties take string, numeric,
or enumerated values. Perl also provides one-part aliases for all general
category, script, and block properties, plus the level-one recommendations from
Unicode Technical Standard #18 on Regular Expressions (version 13, from
2008-08), such as \p{Any}
.
For example, \p{Armenian}
, \p{IsArmenian}
, and \p{Script=Armenian}
all represent
the same property, as do \p{Lu}
, \p{GC=Lu}
, \p{Uppercase_Letter}
, and
\p{General_Category=Uppercase_Letter}
. Other examples of binary properties
(those whose values are implicitly true) include \p{Whitespace}
, \p{Alphabetic}
, \p{Math}
, and \p{Dash}
. Examples of properties that aren’t binary properties
include \p{Bidi_Class=Right_to_Left}
, \p{Word_Break=A_Letter}
, and
\p{Numeric_Value=10}
. The perluniprops manpage lists all properties and their
aliases that Perl supports, both standard Unicode properties and the Perl specials,
too.
The complete list of Unicode character properties, and their meanings, is documented in section 5 on Properties from UAX#44, the Unicode Character Database. Those eleven properties that must be supported to meet UTS#18’s RL 1.2 on Properties are these:
RL1.2 Properties
To meet this requirement, an implementation shall provide at least a minimal list of properties, consisting of the following:
- General_Category
- Script
- Alphabetic
- Uppercase
- Lowercase
- White_Space
- Noncharacter_Code_Point
- Default_Ignorable_Code_Point
- ANY, ASCII, ASSIGNED
Note that the single-letter character class abbreviations like \w
, \d
, \s
, \b
, and their uppercase complements, as well as the POSIX-sounding names like \p{alpha}
, are themselves defined in terms of Unicode character properties in UTS#18’s Annex C on Compatibility Properties.
To the best of my knowledge, the only regex engines currently meeting the Level 1 requirements of UTS#18 for Basic Unicode Support are Perl, ICU’s regex library for C and C++, Java 7’s Pattern
class, and Matthew Barnett’s excellent regexp
library for Python 2 and Python 3. The regexes used in Android are actually ICU’s, not Java’s as one might otherwise imagine, and so work much better with Unicode.
For Java 7, you must use the UNICODE_CHARACTER_CLASS
pattern compilation flag, or an embedded (?U)
, to get the RL1.2a (\w
&c) stuff going. For PCRE, you seem to need to embed (*PCRE_UCP)
, or use that as compilation flag. This may depend on how your version of php was built, which can be a problem.
Russ Cox’s RE2 library, with bindings available for C and C++, plus as Perl regex engine plugin, and now the standard regex library used by Go programming language, supports the two most important properties, both General Category and Script.
PCRE & PHP
I believe that PCRE is still a ways off from meeting RL 1.2’s requirements on properties. It handles both the General Category and the Script properties, which are the two most important and commonly used properties, but does not seem to let you get at the other nine requisite properties. Its POSIX-compatible properties lkike alpha
, upper
, lower
, and space
are specifically documented to be 7-bit ASCII only, in contravention to RL 1.2a. However, PCRE also offers these specials:
Xan
Alphanumeric: union of properties L and N
Xps
POSIX space: property Z or tab, NL, VT, FF, CR
Xsp
Perl space: property Z or tab, NL, FF, CR
Xwd
Perl word: property Xan or underscore
Note that PCRE’s \p{Xan}
is still different from what Unicode says \p{alnum}
must mean, because it’s missing combining marks, for example, and certain alphabetic symbols. The Perl \p{alnum}
follows the Unicode definition. In the away way, PCRE’s \p{Xwd}
differs from Unicode’s (and Perl’s), in that it is missing the extra alphabetics and the rest of the \p{GC=Connector_Punctuation}
characters. The next revision to UTS#18 also adds \p{Join_Control}
to the set of \p{word}
characters.
More Properties
Of those four that meet RL 1.2 and RL 1.2a, all but Java 7 also meet (or come extremely close to meeting, sometimes under an alternate syntax like \N{…}
in lieu of the \p{name=…}
syntax) the new RL 2.7 on Full Properties from the proposed update to UTS#18 posted earlier this month, which reads in part:
RL2.7 Full Properties
To meet this requirement, an implementation shall support all of the properties listed below that are in the supported version of Unicode, with values that match the Unicode definitions for that version.
To meet requirement RL2.7, the implementation must satisfy the Unicode definition of the properties for the supported version of Unicode, rather than other possible definitions. However, the names used by the implementation for these properties may differ from the formal Unicode names for the properties. For example, if a regex engine already has a property called "Alphabetic", for backwards compatibility it may need to use a distinct name, such as "Unicode_Alphabetic", for the corresponding property listed in RL1.2.
[table omitted for brevity —tchrist]
The Name and Name_Alias properties are used in \p{name=…}
and \N{…}
. The data in NamedSequences.txt is also used in \N{…}
. For more information see Section 2.5, Name Properties. The Script and Script_Extensions properties are used in \p{scx=…}
. For more information, see Section 1.2.2, Script_Property.
The list excludes contributory, obsolete, and deprecated properties, most provisional properties, and the Unicode_1_Name and Unicode_Radical_Stroke properties. The properties in gray are covered by RL1.2 Properties. For more information on properties, see UAX #44, Unicode Character Database [UAX44].
Unicode Property Exploration Tools
Three standalone tools that you might want to keep handy for exploring Unicode character properties are uniprops,
unichars, and *uninames. They’re also available as part of the larger Unicode::Tussle suite from CPAN.
Quick demos:
$ uniprops -a 3b1
U+03B1 ‹α› \N{GREEK SMALL LETTER ALPHA}
\w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
All Any Alnum Alpha Alphabetic Assigned Greek Is_Greek InGreek Cased Cased_Letter LC
Changes_When_Casemapped CWCM Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll L Gr_Base
Grapheme_Base Graph GrBase Grek Greek_And_Coptic ID_Continue IDC ID_Start IDS Letter L_
Lowercase_Letter Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum
X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word
Age=1.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Greek Block=Greek_And_Coptic BLK=Greek
Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR
Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=A
East_Asian_Width=Ambiguous EA=A Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX
Script=Greek Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA
Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U
Line_Break=AL Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN
Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1
IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0
Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Grek Script=Grek
Sentence_Break=LO Sentence_Break=Lower SB=LO Word_Break=ALetter WB=LE Word_Break=LE
$ unichars '\pN' '\D' '\p{Latin}'
Ⅰ 8544 02160 ROMAN NUMERAL ONE
Ⅱ 8545 02161 ROMAN NUMERAL TWO
Ⅲ 8546 02162 ROMAN NUMERAL THREE
Ⅳ 8547 02163 ROMAN NUMERAL FOUR
Ⅴ 8548 02164 ROMAN NUMERAL FIVE
Ⅵ 8549 02165 ROMAN NUMERAL SIX
Ⅶ 8550 02166 ROMAN NUMERAL SEVEN
Ⅷ 8551 02167 ROMAN NUMERAL EIGHT
(etc)
$ uninames Old English
æ 00E6 LATIN SMALL LETTER AE
= latin small ligature ae (1.0)
= ash (from Old English æsc)
* Danish, Norwegian, Icelandic, Faroese, Old English, French, IPA
x (latin small ligature oe - 0153)
x (cyrillic small ligature a ie - 04D5)
ð 00F0 LATIN SMALL LETTER ETH
* Icelandic, Faroese, Old English, IPA
x (latin capital letter eth - 00D0)
x (greek small letter delta - 03B4)
x (partial differential - 2202)
þ 00FE LATIN SMALL LETTER THORN
* Icelandic, Old English, phonetics
* Runic letter borrowed into Latin script
x (runic letter thurisaz thurs thorn - 16A6)
œ 0153 LATIN SMALL LIGATURE OE
= ethel (from Old English eðel)
* French, IPA, Old Icelandic, Old English, ...
x (latin small letter ae - 00E6)
x (latin letter small capital oe - 0276)
ƿ 01BF LATIN LETTER WYNN
= wen
* Runic letter borrowed into Latin script
* replaced by "w" in modern transcriptions of Old English
* uppercase is 01F7
x (runic letter wunjo wynn w - 16B9)
ǣ 01E3 LATIN SMALL LETTER AE WITH MACRON
* Old Norse, Old English
: 00E6 0304
⁊ 204A TIRONIAN SIGN ET
* Irish Gaelic, Old English, ...
x (ampersand - 0026)