Matching Unicode Dashes in Java Regular Expressions?
Asked Answered
C

1

7

I'm trying to craft a Java regular expression to split strings of the general format "foo - bar" into "foo" and "bar" using Pattern.split(). The "-" character may be one of several dashes: the ASCII '-', the em-dash, the en-dash, etc. I've constructed the following regular expression:

private static final Pattern titleSegmentSeparator = Pattern.compile("\\s(\\x45|\\u8211|\\u8212|\\u8213|\\u8214)\\s");

which, if I'm reading the Pattern documentation correctly, should capture any of the unicode dashes or the ascii dash, when surrounded on both sides by whitespace. I'm using the pattern as follows:

String[] sectionSegments = titleSegmentSeparator.split(sectionTitle);

No joy. For the sample input below, the dash is not detected, and titleSegmentSeparator.matcher(sectionTitle).find() returns false!

In order to make sure I wasn't missing any unusual character entities, I used System.out to print some debug information. The output is as follows -- each character is followed by the output of (int)char, which should be its' unicode code point, no?

Sample input:

Study Summary (1 of 10) – Competition

S(83)t(116)u(117)d(100)y(121) (32)S(83)u(117)m(109)m(109)a(97)r(114)y(121) (32)((40)1(49) (32)o(111)f(102) (32)1(49)0(48))(41) (32)–(8211) (32)C(67)o(111)m(109)p(112)e(101)t(116)i(105)t(116)i(105)o(111)n(110)

It looks to me like that dash is codepoint 8211, which should be matched by the regex, but it isn't! What's going on here?

Crippen answered 15/6, 2010 at 13:22 Comment(2)
From the docs: "the strings "\u2014" and "\\u2014", while not equal, compile into the same pattern, which matches the character with hexadecimal value 0x2014." That is, you can remove the double \\ in your expression.Spitball
@aioobe: What an enormous coincidence that the Java docs have used exactly the one character as an example that this question is about. Or did you modifiy the quote?Hydrazine
H
13

You're mixing decimal (8211) and hexadecimal (0x8211).

\x and \u both expect a hexadecimal number, therefore you'd need to use \u2014 to match the em-dash, not \u8211 (and \x2D for the normal hyphen etc.).

But why not simply use the Unicode property "Dash punctuation"?

As a Java string: "\\s\\p{Pd}\\s"

Hydrazine answered 15/6, 2010 at 13:37 Comment(1)
Alas, Java doesn’t support the Unicode Dash property in its regexes, which includes things like the MINUS SIGN, which is of type Symbol.Fourlegged

© 2022 - 2024 — McMap. All rights reserved.