regular expression containing unicode words

Asked 12/4, 2011 at 21:14 Answered 2/6, 2015 at 8:56

I'd like to match all strings containing a certain word. like:

String regex = (?:\P{L}|\W|^)(ベスパ)(?:\b|$)

however, the Pattern class doesn't compile it:

java.util.regex.PatternSyntaxException: Unmatched closing ')' near index 39
(?:\P{L}|\W|^)((?:ベス|ベス|ヘズ)(?:パ)|パ)|ハ)゚)(?:\b|$)

I already set unicode_case to compile param, not sure what's going wrong here

final Pattern pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE| Pattern.CANON_EQ);

Thanks for help! :)

Dichromic answered 12/4, 2011 at 21:14 Comment(3)

The pattern in your error message does contain two extra ) - is the error message or your post wrong? – Darkroom 12/4, 2011 at 21:22

You must not use \W, \w, \s, \d, \b, \p{alpha}, nor any of the other character-class shortcuts in Java regexs, because the Java regex library is non-compliant with the formal requirements of Unicode regular expressions. You can simulate \w with [\pL\pM\p{Nd}\p{Nl}\p{Pc}] and \W with [^\pL\pM\p{Nd}\p{Nl}\p{Pc}] if you don’t care about the Enclosed_Alphanumerics. Or you can use a regex library or language that complies with The Unicode Standard. That means calling the ICU regex library, or calling Perl’s, etc. – Novgorod 12/4, 2011 at 21:31

Did you compile with java -encoding UTF-8? – Novgorod 12/4, 2011 at 22:6

From the error message given, which looks nothing at all like the String regex shown, I infer that the original pattern was essentially as follows, which I have taken the liberty to reformat, add symbolic constants to, and preface with line numbers that we might inspect and address it more easily.

(All non-trivial patterns should always be written in (?x) mode — even though Java fights against you here, you should still do it.)

  1     (?: \P{L} | \W | ^ )
  2     (
  3         (?: \N{KATAKANA LETTER BE} \N{KATAKANA LETTER SU}
  4           | \N{KATAKANA LETTER BE} \N{KATAKANA LETTER SU}
  5           | \N{KATAKANA LETTER HE} \N{KATAKANA LETTER ZU}
  6         )
  7         (?: \N{KATAKANA LETTER PA} )
  8     |
  9             \N{KATAKANA LETTER PA}
 10     )
 11 |
 12             \N{KATAKANA LETTER HA}
 13     )
 14     \N{COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK}
 15     )
 16     (?: \b | $ )

The first and last lines are wrong, but they are wrong in a semantic way related to Java’s broken regexes. They are not syntactically wrong.

As should now be apparent, the syntactic issue is that the close parentheses at lines 13 and 15 are spurious: they have no corresponding open parentheses.

The first and last lines notwithstanding, I am still trying to understand what it is you are truly trying to do here. Why the duplication of lines 3 and 4? That doesn’t do anything useful. And I can see no reason for the grouping at line 7.

Is the intent to allow the combining mark to apply to any of the preceding things?

As for the errors in the first and last lines, do I understand that a simple word boundary is all that you are looking for? Do you actually mean to include those boundary characters there as part of your match, or are you just trying to establish boundaries? Why are you saying a non-letter or a non-word?

Word characters do include letters, you know — at least, according to the Unicode spec they do, even if Java does get this wrong. Alas, you’ve just included a bunch of letters though because of the Java regex bug, so we will have to recode this once I understand what you really want.

If only you used something that was actually compliant with UTS#18, it would work ok, but as I presume you haven’t (I heard no mention of ICU), we’ll have to fix it along the lines I have previously outlined.

A lookbehind for either a non-word or the start of string would work for the first one, and a lookahead for either a non-word or the end of string would work for the last one. That is what \b is of course supposed to do when facing word characters as you have here, and it might even work out that way provided you stay clear of your non-word particle.

But until I can see more of the original intent, I don’t think I should say more.

Novgorod answered 13/4, 2011 at 0:40 Comment(0)

(?:\P{L}|\W|^)((?:ベス|ベス|ヘズ)(?:パ)|パ)|ハ)゚)(?:\b|$)
(            )((              )(   )   )   )  )(      )

The pattern in your error message has two extra ')'

Darkroom answered 12/4, 2011 at 21:25 Comment(7)

Yes, but why does he get that error message? There are no unmatch parentheses in his original expression. – Papain 12/4, 2011 at 21:30

Um, no. That \W is going to ruin your day. – Novgorod 12/4, 2011 at 21:34

@aioobe: Good question. We cannot know because he has not posted the exact Java code that initializes his String regex variable. – Novgorod 12/4, 2011 at 21:37

Well, I would assume it is String regex = "(?:\\P{L}|\W|^)(ベスパ)(?:\\b|$)";. – Papain 12/4, 2011 at 21:40

@aioobe: Or perhaps he didn't post the correct code but copy-pasted the error. – Darkroom 12/4, 2011 at 22:28

@Erik, I get the same error as he posted, by using the regex string I just gave (plus an extra ` in front of \W` which I forgot). – Papain 13/4, 2011 at 5:38

the mismatch error happens after Pattern.compile, my original regex is in correct syntax. :( – Dichromic 20/4, 2011 at 7:42

Unicode characters in regular expressions is a tricky business.

Here is a paragraph from the documentation of Pattern:

Unicode support

This class follows Unicode Technical Report #18: Unicode Regular Expression Guidelines, implementing its second level of support though with a slightly different concrete syntax.

Unicode escape sequences such as \u2014 in Java source code are processed as described in ?3.3 of the Java Language Specification. Such escape sequences are also implemented directly by the regular-expression parser so that Unicode escapes can be used in expressions that are read from files or from the keyboard. Thus the strings "\u2014" and "\\u2014", while not equal, compile into the same pattern, which matches the character with hexadecimal value 0x2014.

Thus, since we know:

ベ = \u3049
ス = \u30B9
パ = \u30D1

the proper way to write the pattern you're after is:

String regex = "(?:\\P{L}|\\W|^)(\\u30d9\\u30B9\\u30D1)(?:\\b|$)";

Further reading:

Papain answered 12/4, 2011 at 21:42 Comment(3)

No, I am sorry, but that document LIES. Believe, me it does. Java isn’t even Level-1 compliant, let alone Level-2 the way it claims. I’ve been working with the JDK7 people, and they now understand how badly it lies. You must not use those things. Honest. All the RL1.2a things are busted in Java; it supplies only 3 out of the required 11 properties for RL1.2; it can’t even do RL1.1 right. There are lots of serious things wrong with it. It does not even come close to providing Level 1 support. – Novgorod 12/4, 2011 at 22:6

Lol, don't you have anything better to do, than sit around waiting for a regexp question to pop up which you can complain about? You show up at just about every regexp question, explaining how broken Java regular expressions are. Why don't you just keep quiet unless you actually know the answer to the question? – Papain 12/4, 2011 at 22:8

He does know the answer. More to the point, he knows that any answer that doesn't mention how badly broken Java's regex support is, is wrong. And he isn't just complaining, he's explained many times how to correctly match Unicode with Java's regex classes. But it's a lot of information, and he can't be expected to post it all every time. – Anglice 12/4, 2011 at 22:35

The UNICODE_CHARACTER_CLASS mode can also be enabled via the embedded flag expression (?U)

try:

(?U)(?:\P{L}|\W|^)((?:ベス|ベス|ヘズ)(?:パ)|パ)|ハ)゚)(?:\b|$)

But fix your brackets first as I don't know what you want in or out in the middle group

Fa answered 2/6, 2015 at 8:56 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags