Yes and no.
If you want all alphanumerics, you want [\p{Alphabetic}\p{GC=Number}]
. The \w
contains both more and less than that. It specifically excludes any \pN
which is not \p{Nd}
nor \p{Nl}
, like the superscripts, subscripts, and fractions. Those are \p{GC=Other_Number}
, and are not included in \w
.
Because unlike most regex systems, Perl complies with Requirement 1.2a, “Compatibility Properties” from UTS #18 on Unicode Regular Expressions, then assuming you have Unicode strings, a \w
in a regex matches any single code point that has any of the following four properties:
\p{GC=Alphabetic}
\p{GC=Mark}
\p{GC=Connector_Punctuation}
\p{GC=Decimal_Number}
Number 4 above can be expressed in any of these ways, which are all considered equivalent:
\p{Digit}
\p{General_Category=Decimal_Number}
\p{GC=Decimal_Number}
\p{Decimal_Number}
\p{Nd}
\p{Numeric_Type=Decimal}
\p{Nt=De}
Note that \p{Digit}
is not the same as \p{Numeric_Type=Digit}
. For example, code point B2, SUPERSCRIPT TWO, has only the \p{Numeric_Type=Digit}
property and not plain \p{Digit}
. That is because it is considered a \p{Other_Number}
or \p{No}
. It does, however, have the \p{Numeric_Value=2}
property as you would imagine.
It’s really point number 1 above, \p{Alphabetic}
,that gives people the most trouble. That’s because they too often mistakenly think it is somehow the same as \p{Letter}
(\pL
), but it is not.
Alphabetics include much more than that, all because of the \p{Other_Alphabetic}
property, as this in turn
includes some but not all \p{GC=Mark}
, all of \p{Lowercase}
(which is not the same as \p{GC=Ll}
because it adds \p{Other_Lowercase}
) and all of \p{Uppercase}
(which is not the same as \p{GC=Lu}
because it adds \p{Other_Uppercase}
).
That’s how it pulls in \p{GC=Letter_Number}
like Roman numerals and also
all the circled letters, which are of type \p{Other_Symbol}
and \p{Block=Enclosed_Alphanumerics}
.
Aren’t you glad we get to use \w
? :)
\w
will only match ASCII word characters. – Boundary