Unicode equivalents for \w and \b in Java regular expressions?
Asked Answered
P

3

134

Many modern regex implementations interpret the \w character class shorthand as "any letter, digit, or connecting punctuation" (usually: underscore). That way, a regex like \w+ matches words like hello, élève, GOÄ_432 or gefräßig.

Unfortunately, Java doesn't. In Java, \w is limited to [A-Za-z0-9_]. This makes matching words like those mentioned above difficult, among other problems.

It also appears that the \b word separator matches in places where it shouldn't.

What would be the correct equivalent of a .NET-like, Unicode-aware \w or \b in Java? Which other shortcuts need "rewriting" to make them Unicode-aware?

Phineas answered 29/11, 2010 at 15:0 Comment(3)
The short story, Tim, is that they all need writing to bring them into line with Unicode. I still see no signs that Java 1.7 will do anything more with Unicode properties than finally adding support for scripts, but that’s it. There are some things you really cannot do without better access to the full complement of Unicode properties. If you don’t yet have my uniprops and unichars scripts (and uninames), they’re stunning eye-openers into all this.Outfield
One might consider adding marks to the word class. Since for example ä can be represented in Unicode either as \u0061\u0308 or \u00E4.Wherein
Hey Tim, check out my UPDATE. They have added a flag to make it all work. Hurray!Outfield
O
246

Source code

The source code for the rewriting functions I discuss below is available here.

Update in Java 7

Sun’s updated Pattern class for JDK7 has a marvelous new flag, UNICODE_CHARACTER_CLASS, which makes everything work right again. It’s available as an embeddable (?U) for inside the pattern, so you can use it with the String class’s wrappers, too. It also sports corrected definitions for various other properties, too. It now tracks The Unicode Standard, in both RL1.2 and RL1.2a from UTS#18: Unicode Regular Expressions. This is an exciting and dramatic improvement, and the development team is to be commended for this important effort.


Java’s Regex Unicode Problems

The problem with Java regexes is that the Perl 1.0 charclass escapes — meaning \w, \b, \s, \d and their complements — are not in Java extended to work with Unicode. Alone amongst these, \b enjoys certain extended semantics, but these map neither to \w, nor to Unicode identifiers, nor to Unicode line-break properties.

Additionally, the POSIX properties in Java are accessed this way:

POSIX syntax    Java syntax

[[:Lower:]]     \p{Lower}
[[:Upper:]]     \p{Upper}
[[:ASCII:]]     \p{ASCII}
[[:Alpha:]]     \p{Alpha}
[[:Digit:]]     \p{Digit}
[[:Alnum:]]     \p{Alnum}
[[:Punct:]]     \p{Punct}
[[:Graph:]]     \p{Graph}
[[:Print:]]     \p{Print}
[[:Blank:]]     \p{Blank}
[[:Cntrl:]]     \p{Cntrl}
[[:XDigit:]]    \p{XDigit}
[[:Space:]]     \p{Space}

This is a real mess, because it means that things like Alpha, Lower, and Space do not in Java map to the Unicode Alphabetic, Lowercase, or Whitespace properties. This is exceeedingly annoying. Java’s Unicode property support is strictly antemillennial, by which I mean it supports no Unicode property that has come out in the last decade.

Not being able to talk about whitespace properly is super-annoying. Consider the following table. For each of those code points, there is both a J-results column for Java and a P-results column for Perl or any other PCRE-based regex engine:

             Regex    001A    0085    00A0    2029
                      J  P    J  P    J  P    J  P
                \s    1  1    0  1    0  1    0  1
               \pZ    0  0    0  0    1  1    1  1
            \p{Zs}    0  0    0  0    1  1    0  0
         \p{Space}    1  1    0  1    0  1    0  1
         \p{Blank}    0  0    0  0    0  1    0  0
    \p{Whitespace}    -  1    -  1    -  1    -  1
\p{javaWhitespace}    1  -    0  -    0  -    1  -
 \p{javaSpaceChar}    0  -    0  -    1  -    1  -

See that?

Virtually every one of those Java white space results is   ̲w̲r̲o̲n̲g̲  according to Unicode. It’s a really big problem. Java is just messed up, giving answers that are “wrong” according to existing practice and also according to Unicode. Plus Java doesn’t even give you access to the real Unicode properties! In fact, Java does not support any property that corresponds to Unicode whitespace.


The Solution to All Those Problems, and More

To deal with this and many other related problems, yesterday I wrote a Java function to rewrite a pattern string that rewrites these 14 charclass escapes:

\w \W \s \S \v \V \h \H \d \D \b \B \X \R

by replacing them with things that actually work to match Unicode in a predictable and consistent fashion. It’s only an alpha prototype from a single hack session, but it is completely functional.

The short story is that my code rewrites those 14 as follows:

\s => [\u0009-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000]
\S => [^\u0009-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000]

\v => [\u000A-\u000D\u0085\u2028\u2029]
\V => [^\u000A-\u000D\u0085\u2028\u2029]

\h => [\u0009\u0020\u00A0\u1680\u180E\u2000-\u200A\u202F\u205F\u3000]
\H => [^\u0009\u0020\u00A0\u1680\u180E\u2000\u2001-\u200A\u202F\u205F\u3000]

\w => [\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]
\W => [^\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]

\b => (?:(?<=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])|(?<![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]))
\B => (?:(?<=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])|(?<![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]))

\d => \p{Nd}
\D => \P{Nd}

\R => (?:(?>\u000D\u000A)|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029])

\X => (?>\PM\pM*)

Some things to consider...

  • That uses for its \X definition what Unicode now refers to as a legacy grapheme cluster, not an extended grapheme cluster, as the latter is rather more complicated. Perl itself now uses the fancier version, but the old version is still perfectly workable for the most common situations. EDIT: See addendum at bottom.

  • What to do about \d depends on your intent, but the default is the Uniode definition. I can see people not always wanting \p{Nd}, but sometimes either [0-9] or \pN.

  • The two boundary definitions, \b and \B, are specifically written to use the \w definition.

  • That \w definition is overly broad, because it grabs the parenned letters not just the circled ones. The Unicode Other_Alphabetic property isn’t available until JDK7, so that’s the best you can do.


Exploring Boundaries

Boundaries have been a problem ever since Larry Wall first coined the \b and \B syntax for talking about them for Perl 1.0 back in 1987. The key to understanding how \b and \B both work is to dispel two pervasive myths about them:

  1. They are only ever looking for \w word characters, never for non-word characters.
  2. They do not specifically look for the edge of the string.

A \b boundary means:

    IF does follow word
        THEN doesn't precede word
    ELSIF doesn't follow word
        THEN does precede word

And those are all defined perfectly straightforwardly as:

  • follows word is (?<=\w).
  • precedes word is (?=\w).
  • doesn’t follow word is (?<!\w).
  • doesn’t precede word is (?!\w).

Therefore, since IF-THEN is encoded as an and ed-together AB in regexes, an or is X|Y, and because the and is higher in precedence than or, that is simply AB|CD. So every \b that means a boundary can be safely replaced with:

    (?:(?<=\w)(?!\w)|(?<!\w)(?=\w))

with the \w defined in the appropriate way.

(You might think it strange that the A and C components are opposites. In a perfect world, you should be able to write that AB|D, but for a while I was chasing down mutual exclusion contradictions in Unicode properties — which I think I’ve taken care of, but I left the double condition in the boundary just in case. Plus this makes it more extensible if you get extra ideas later.)

For the \B non-boundaries, the logic is:

    IF does follow word
        THEN does precede word
    ELSIF doesn't follow word
        THEN doesn't precede word

Allowing all instances of \B to be replaced with:

    (?:(?<=\w)(?=\w)|(?<!\w)(?!\w))

This really is how \b and \B behave. Equivalent patterns for them are

  • \b using the ((IF)THEN|ELSE) construct is (?(?<=\w)(?!\w)|(?=\w))
  • \B using the ((IF)THEN|ELSE) construct is (?(?=\w)(?<=\w)|(?<!\w))

But the versions with just AB|CD are fine, especially if you lack conditional patterns in your regex language — like Java. ☹

I’ve already verified the behaviour of the boundaries using all three equivalent definitions with a test suite that checks 110,385,408 matches per run, and which I've run on a dozen different data configurations according to:

     0 ..     7F    the ASCII range
    80 ..     FF    the non-ASCII Latin1 range
   100 ..   FFFF    the non-Latin1 BMP (Basic Multilingual Plane) range
 10000 .. 10FFFF    the non-BMP portion of Unicode (the "astral" planes)

However, people often want a different sort of boundary. They want something that is whitespace and edge-of-string aware:

  • left edge as (?:(?<=^)|(?<=\s))
  • right edge as (?=$|\s)

Fixing Java with Java

The code I posted in my other answer provides this and quite a few other conveniences. This includes definitions for natural-language words, dashes, hyphens, and apostrophes, plus a bit more.

It also allows you to specify Unicode characters in logical code points, not in idiotic UTF-16 surrogates. It’s hard to overstress how important that is! And that’s just for the string expansion.

For regex charclass substitution that makes the charclass in your Java regexes finally work on Unicode, and work correctly, grab the full source from here. You may do with it as you please, of course. If you make fixes to it, I’d love to hear of it, but you don’t have to. It’s pretty short. The guts of the main regex rewriting function is simple:

switch (code_point) {

    case 'b':  newstr.append(boundary);
               break; /* switch */
    case 'B':  newstr.append(not_boundary);
               break; /* switch */

    case 'd':  newstr.append(digits_charclass);
               break; /* switch */
    case 'D':  newstr.append(not_digits_charclass);
               break; /* switch */

    case 'h':  newstr.append(horizontal_whitespace_charclass);
               break; /* switch */
    case 'H':  newstr.append(not_horizontal_whitespace_charclass);
               break; /* switch */

    case 'v':  newstr.append(vertical_whitespace_charclass);
               break; /* switch */
    case 'V':  newstr.append(not_vertical_whitespace_charclass);
               break; /* switch */

    case 'R':  newstr.append(linebreak);
               break; /* switch */

    case 's':  newstr.append(whitespace_charclass);
               break; /* switch */
    case 'S':  newstr.append(not_whitespace_charclass);
               break; /* switch */

    case 'w':  newstr.append(identifier_charclass);
               break; /* switch */
    case 'W':  newstr.append(not_identifier_charclass);
               break; /* switch */

    case 'X':  newstr.append(legacy_grapheme_cluster);
               break; /* switch */

    default:   newstr.append('\\');
               newstr.append(Character.toChars(code_point));
               break; /* switch */

}
saw_backslash = false;

Anyway, that code is just an alpha release, stuff I hacked up over the weekend. It won’t stay that way.

For the beta I intend to:

  • fold together the code duplication

  • provide a clearer interface regarding unescaping string escapes versus augmenting regex escapes

  • provide some flexibility in the \d expansion, and maybe the \b

  • provide convenience methods that handle turning around and calling Pattern.compile or String.matches or whatnot for you

For production release, it should have javadoc and a JUnit test suite. I may include my gigatester, but it’s not written as JUnit tests.


Addendum

I have good news and bad news.

The good news is that I’ve now got a very close approximation to an extended grapheme cluster to use for an improved \X.

The bad news ☺ is that that pattern is:

(?:(?:\u000D\u000A)|(?:[\u0E40\u0E41\u0E42\u0E43\u0E44\u0EC0\u0EC1\u0EC2\u0EC3\u0EC4\uAAB5\uAAB6\uAAB9\uAABB\uAABC]*(?:[\u1100-\u115F\uA960-\uA97C]+|([\u1100-\u115F\uA960-\uA97C]*((?:[[\u1160-\u11A2\uD7B0-\uD7C6][\uAC00\uAC1C\uAC38]][\u1160-\u11A2\uD7B0-\uD7C6]*|[\uAC01\uAC02\uAC03\uAC04])[\u11A8-\u11F9\uD7CB-\uD7FB]*))|[\u11A8-\u11F9\uD7CB-\uD7FB]+|[^[\p{Zl}\p{Zp}\p{Cc}\p{Cf}&&[^\u000D\u000A\u200C\u200D]]\u000D\u000A])[[\p{Mn}\p{Me}\u200C\u200D\u0488\u0489\u20DD\u20DE\u20DF\u20E0\u20E2\u20E3\u20E4\uA670\uA671\uA672\uFF9E\uFF9F][\p{Mc}\u0E30\u0E32\u0E33\u0E45\u0EB0\u0EB2\u0EB3]]*)|(?s:.))

which in Java you’d write as:

String extended_grapheme_cluster = "(?:(?:\\u000D\\u000A)|(?:[\\u0E40\\u0E41\\u0E42\\u0E43\\u0E44\\u0EC0\\u0EC1\\u0EC2\\u0EC3\\u0EC4\\uAAB5\\uAAB6\\uAAB9\\uAABB\\uAABC]*(?:[\\u1100-\\u115F\\uA960-\\uA97C]+|([\\u1100-\\u115F\\uA960-\\uA97C]*((?:[[\\u1160-\\u11A2\\uD7B0-\\uD7C6][\\uAC00\\uAC1C\\uAC38]][\\u1160-\\u11A2\\uD7B0-\\uD7C6]*|[\\uAC01\\uAC02\\uAC03\\uAC04])[\\u11A8-\\u11F9\\uD7CB-\\uD7FB]*))|[\\u11A8-\\u11F9\\uD7CB-\\uD7FB]+|[^[\\p{Zl}\\p{Zp}\\p{Cc}\\p{Cf}&&[^\\u000D\\u000A\\u200C\\u200D]]\\u000D\\u000A])[[\\p{Mn}\\p{Me}\\u200C\\u200D\\u0488\\u0489\\u20DD\\u20DE\\u20DF\\u20E0\\u20E2\\u20E3\\u20E4\\uA670\\uA671\\uA672\\uFF9E\\uFF9F][\\p{Mc}\\u0E30\\u0E32\\u0E33\\u0E45\\u0EB0\\u0EB2\\u0EB3]]*)|(?s:.))";

¡Tschüß!

Outfield answered 29/11, 2010 at 19:27 Comment(10)
Christ, that's an enlightenend answer. I only don't get the Jon Skeet reference. What has he to do with this?Dressing
@BalusC: It’s a ref to Jon earlier saying he’d let me field the question. But please, don’t drop the t in @tchrist. It might go to my head. :)Outfield
Have you thought about adding this to the OpenJDK?Tidbit
@Martijn: I hadn’t, no; I didn’t know it was that “open”. :) But I have thought about releasing it in a more formal sense; others in my department wish to see that done (with some sort of open-source licence, probably BSD or ASL). I'm probably going to change the API from what it is in this alpha prototype, clean up the code, etc. But it helps us out tremendously, and we figure it will help others, too. I really wish Sun would do something about their library, but Oracle inspires no confidence.Outfield
@Outfield - I've actually recently met a number of Oracle developers on the OpenJDK and they're really community focussed and really happy to see improvements like this added. please do post your answer to the open JDK mailing list, especially now that they're working on Unicode 6.0 support. It's the perfect time to improve Java! It's GPL licensed as well so I hope that's acceptable for your contribution. Ping me an email if you want some introductions etc, I think your work/analysis is too good to remain unnoticed!Tidbit
@tchrist: just pun intented :)Dressing
Exactly what is so addictive about regex? We don't need regex for anything, things that can be solved with regex can be solved just as easily (or even easier) without regex. Regex is considered evil.Phthisic
@tchrist: Nice work! Not sure whether you have released it in the meantime (as mentioned above). If yes, could you point me to where? If not, do you give permission (BSD style license) to use even the (alpha) code in commercial products (of course with the usual disclaimer, like you won't take any responsibility, i.e. it's my fault if it kill my cat :-) )?Offstage
@Outfield Your lengthy \w replacement has a few missing chars when compared to Java 8's (?U)\w, namely Java 8 matches these additional characters: [\u200c\u200d\u24b6-\u249b] (only tested char in Character.MIN_VALUE to Character.MAX_VALUE). You might want to upgrade your pattern above for those stuck on pre-Java 7 (khm, Android).Shel
@Shel Oh interesting. The replacement code was developed as a workaround for Java 7, and it was me talking to the actual Oracle dev in charge of the Java regex stuff about this and related matters that brought them to understand what was absent before this and therefore got so much stuff fixed and enhanced for Java 8 regex support for Unicode.Outfield
W
16

It's really unfortunate that \w doesn't work. The proposed solution \p{Alpha} doesn't work for me either.

It seems [\p{L}] catches all Unicode letters. So the Unicode equivalent of \w should be [\p{L}\p{Digit}_].

Widmer answered 29/11, 2010 at 15:18 Comment(8)
But \w also matches digits and more. I think for just letters, \p{L} would work.Phineas
You're right. \p{L} is enough. Also I thought that only letters were the problem. [\p{L}\p{Digit}_] should catch all alphanumeric characters including underscore.Widmer
@MusicKk: See my answer for a complete solution that allows you to write your patterns normally, but then pass it through a function that corrects for Java’s gaping lacunae so that it works properly on Unicode.Outfield
No, \w is defined by Unicode as being much broader than just \pL and the ASCII digits, of all silly things. You must write [\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]] if you want a Unicode-aware \w for Java — or you can just use my unicode_charclass function from here. Sorry!Outfield
@Tim, yes, for letters \pL does work (you don’t need to embrace one-letter props). However, you seldom want that, because you have to be rather careful that your match doesn’t get different answers just because your data is in Unicode Normalization Form D (a.k.a. NFD, meaning canonical decomposition) versus being in NFC (NFD followed by canonical composition). An example is that code point U+E9 ("é") is a \pL in NFC form, but its NFD form becomes U+65.301, so matches \pL\pM. You can kinda get around this with \X: (?:(?=\pL)\X), but you’ll need my version of that for Java. :(Outfield
@tchrist: You don't need to advertise your answer, it's hard to miss. I just spent five minutes on my answer, not five hours. ;)Widmer
@MusiKk: I’m really just trying to help. The reason this is a big deal for me is that I work in a biomedical text-mining group, where most of our text is in UTF-8 and most of our regexes are in Java. Before I joined the group, they just couldn’t get anything to work right, even though their Perl prototypes always did the right thing before they were translated into Java. You’re right that I’ve spent some time trying to figure this whole thing out. :(Outfield
@tchrist: Your help is much appreciated! I only skimmed it over but will read your answer completely when I find the time.Widmer
N
7

In Java, \w and \d are not Unicode-aware; they only match the ASCII characters, [A-Za-z0-9_] and [0-9]. The same goes for \p{Alpha} and friends (the POSIX "character classes" they're based on are supposed to be locale-sensitive, but in Java they've only ever matched ASCII characters). If you want to match Unicode "word characters" you you have to spell it out, e.g. [\pL\p{Mn}\p{Nd}\p{Pc}],for letters, non-spacing modifiers (accents), decimal digits, and connecting punctuation.

However, Java's \b is Unicode-savvy; it uses Character.isLetterOrDigit(ch) and checks for accented letters as well, but the only "connecting punctuation" character it recognizes is the underscore. EDIT: when I try your sample code, it prints "" and élève" as it should (see it on ideone.com).

Nicholson answered 29/11, 2010 at 16:43 Comment(11)
I’m sorry, Alan, but you really cannot say that Java’s \b is Unicode‐savvy. It makes tons and tons of mistakes. "\u2163=", "\u24e7=", and "\u0301=" all fail to matched pattern "\\b=" in Java, but are supposed to — as perl -le 'print /\b=/ || 0 for "\x{2163}=", "\x{24e7}=", "\x{301}="' reveals. However, if (and only if) you swap in my version of a word boundary instead of the native \b in Java, then those all work in Java, too.Outfield
@tchrist: I wasn't commenting on \b's correctness, just pointing out that it operates on Unicode characters (as implemented in Java), not just ASCII like \w and friends. However, it does work correctly with respect to \u0301 when that character is paired with a base character, as in e\u0301=. And I'm not convinced that Java is wrong in this instance. How can a combining mark be considered a word character unless it's part of a grapheme cluster with a letter?Nicholson
@Alan, this is something that was cleared up when Unicode clarified grapheme clusters by discussing extended vs legacy grapheme clusters. The old definition of a grapheme cluster, wherein \X stands for a non-mark followed by any number of marks, is problematic, because you should be able to describe all files as matching /^(\X*\R)*\R?$/, but you can’t if you have a \pM at the start of the file, or even of a line. So they’ve eXtended it to always match at least one character. It always did, but now it makes the above pattern work. […continued…]Outfield
@Alan, as for the positionality of the boundaries, you cannot say that a mark is an alphabetic if it’s on an alphabetic but not if it isn’t. That isn’t how they defined things in UTS#18.Outfield
@Alan, it does more harm than good that Java’s native \b is partially Unicode-aware. Consider matching the string "élève" against the pattern \b(\w+)\b. See the problem?Outfield
@tchrist: Yes, without the word boundaries, \w+ finds two matches: l and ve, which is bad enough. But with the word boundaries it finds nothing, because \b recognizes é and è as word characters. At the very minimum, \b and \w should agree on what's a word character and what isn't.Nicholson
@Alan: That’s how I feel too. You’re always welcome to use my alpha code; it at least does work, although it needs an internal reorg for beta. I just added a beter version of \X, too. I sure do wish the Java people were more serious about this stuff.Outfield
@tchrist: As I understand it, they defined \w and friends in terms of ASCII for performance reasons, figuring we could always use Unicode properties like \pL if we needed to. Short-sighted I know, but now we're stuck with it for backward-compatibility reasons. Maybe we could add a /u modifier that forces everything to be Unicode-correct.Nicholson
@Alan: While there is (?u), that only does — or pretends to do — Unicode casing in conjunction with /i. It actually does not do this correctly. I cannot believe anyone really uses this stuff, it is all so massively broken. :(Outfield
@Alan: As for \w and performance: if it doesn’t have to be correct, I can make it infinitely fast. :) Apparently that was the approach selected here.Outfield
@tchrist: I guess it's obvious that I don't use (?u), since I forgot Java even had it. What surprises me is how much you can still accomplish without proper Unicode support. That can't go on much longer, though.Nicholson

© 2022 - 2024 — McMap. All rights reserved.