How to check which language supports which Support Level in Unicode Regular Expressions?
Asked Answered
G

2

6

The various levels of Unicode Regular Expression support are described in UTS#18.

Is there a way to to have a few tests for every requirement, so it is possible to port the tests to the language in question, run them and gather the results?

Do other Unicode documents also have a notion of support levels, e. g. for String implementations/libraries?

Graft answered 19/8, 2011 at 18:0 Comment(1)
Did you see regular-expressions.info/refunicode.html for a comparison? Hard to beat Perl there.Banket
A
7

For the record, both ICU4C and Perl support UTS#18 Level 1 along with several important several Level 2 features. These include named characters with \N{...}, graphemes with \X, full properties like \p{East_Asian_Width=Full_Width}, and in ICU's case, also fancier word default boundaries via a tweaked \b. All three of those Level-2 regex features significantly ease using regexes on Unicode, and without them, you have to do unpleasant things at best, and at worst cannot do them at all.

Perl and ICU4C are somewhat different though, in that Perl supports full string-based casefolding while ICU supports only simple char-based casefolding. Perl also has quite a few non-Unicode regex extensions that ICU doesn't support, such as lookarounds and named groups in your regexes, which are both really quite useful.

Perl also allows user-defined/custom properties and named characters, which are useful for lots of things, including private use area (PUA) code points, since you can now define your own names and properties for whatever PUA characters you fancy using. (For example, for scripts scheduled for inclusion in Unicode, such as those in the unofficial ConScript registry.)

Java does not support even UTS#18 Level 1 until the very recently released JDK7, and then only minimally. With Java6 or earlier all kinds of little stuff is wrong or missing. All in all, Java's Unicode support in the JDK is very weak: you should use ICU4J's UCharacter etc classes, not the OraSun classes, for any serious Unicode work, or you will go nuts. Truly.

But beyond those few, nothing else comes even close. You can sometimes limp along in Python or Ruby if you are careful and don't need to do too much: e.g., no sorting or searching, virtually no Unicode character properties, not even proper word boundaries, etc.

People trying to do really anything at all with Unicode in Javascript or PHP should just quit before they start. It is too painful, because you cannot manipulate Unicode in any useful or realistic fashion without access to character properties and probably to graphemes.

There are also cross-language Unicode issues of casemapping and casefolding, normalization, linebreaking, and collation, all of which vary between languages. You need access to most if not all of them for Unicode work. Not having full property support is a real problem with almost all languages, because character properties are the foundation on which many algorithm depend.

I talk about most of this in my Unicode Support Shootout talk.

The Bottom Line

The bottom line is that as of this writing, if you can't either use ICU regexes or Perl itself (but not PCRE), or perhaps also Matthew Barnett's regex library in Python, then you're basically screwed with Unicode regular expressions. Nobody else currently takes regexes and/or Unicode seriously enough even though Unicode is 20 years old.

This has severe implications for "webbish" languages like Javascript and PHP, because no usable alternatives are available, and you must therefore offload any of the real work to a different server-side language because the webbish languages can't handle Unicode in any reasonable fashion. There is nothing at all that works client-side, which is a severe burden.

Also, note that to get ICU regexes through Java requires rolling your own JNI (or using those from Android) to get at ICU4C: there are no ICU4J bindings for the ICU regexes.

Alejoa answered 20/8, 2011 at 4:9 Comment(3)
Hey, you are the one from "Unicode: Good, bad, & Ugly"? Thank you for your enlightening presentation (is there a video of the talks somewhere?) and your hard work! I pretty much wrote this question after reading your slides again.Graft
@soc: Yeah, that's me. I think there is video of it; they filmed it at least. Look around on the O'Reilly conference site.Alejoa
OK, thanks. I created a new question about Unicode and language design here: #7131523, if you're interested.Graft
I
3

I imagine there are existing tests somewhere that validate support levels.
In Perl for example, there is fairly extensive docs for Unicode support in regex's
and Unicode support in general, in the language.

Example Perl regex support level docs:
http://perldoc.perl.org/perlunicode.html#Unicode-Regular-Expression-Support-Level

Unicode is so complex though, test cases would come from the language writers.

Intertype answered 19/8, 2011 at 19:37 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.