Regex: what is InCombiningDiacriticalMarks?

C

2

99

The following code is very well known to convert accented chars into plain Text:

Normalizer.normalize(text, Normalizer.Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+", "");

I replaced my "hand made" method by this one, but i need to understand the "regex" part of the replaceAll

1) What is "InCombiningDiacriticalMarks" ?
2) Where is the documentation of it? (and similars?)

Thanks.

Catechetical answered 17/4, 2011 at 23:26 Comment(1)

See also https://mcmap.net/q/102521/-detect-any-combining-character-in-java apparently there are more "combining marks" in unicode than just the diacritical ones, just as a note. – Timi 18/3, 2015 at 21:42

P

85

\p{InCombiningDiacriticalMarks} is a Unicode block property. In JDK7, you will be able to write it using the two-part notation \p{Block=CombiningDiacriticalMarks}, which may be clearer to the reader. It is documented here in UAX#44: “The Unicode Character Database”.

What it means is that the code point falls within a particular range, a block, that has been allocated to use for the things by that name. This is a bad approach, because there is no guarantee that the code point in that range is or is not any particular thing, nor that code points outside that block are not of essentially the same character.

For example, there are Latin letters in the \p{Latin_1_Supplement} block, like é, U+00E9. However, there are things that are not Latin letters there, too. And of course there are also Latin letters all over the place.

Blocks are nearly never what you want.

In this case, I suspect that you may want to use the property \p{Mn}, a.k.a. \p{Nonspacing_Mark}. All the code points in the Combining_Diacriticals block are of that sort. There are also (as of Unicode 6.0.0) 1087 Nonspacing_Marks that are not in that block.

That is almost the same as checking for \p{Bidi_Class=Nonspacing_Mark}, but not quite, because that group also includes the enclosing marks, \p{Me}. If you want both, you could say [\p{Mn}\p{Me}] if you are using a default Java regex engine, since it only gives access to the General_Category property.

You’d have to use JNI to get at the ICU C++ regex library the way Google does in order to access something like \p{BC=NSM}, because right now only ICU and Perl give access to all Unicode properties. The normal Java regex library supports only a couple of standard Unicode properties. In JDK7 though there will be support for the Unicode Script propery, which is just about infinitely preferable to the Block property. Thus you can in JDK7 write \p{Script=Latin} or \p{SC=Latin}, or the short-cut \p{Latin}, to get at any character from the Latin script. This leads to the very commonly needed [\p{Latin}\p{Common}\p{Inherited}].

Be aware that that will not remove what you might think of as “accent” marks from all characters! There are many it will not do this for. For example, you cannot convert Đ to D or ø to o that way. For that, you need to reduce code points to those that match the same primary collation strength in the Unicode Collation Table.

Another place where the \p{Mn} thing fails is of course enclosing marks like \p{Me}, obviously, but also there are \p{Diacritic} characters which are not marks. Sadly, you need full property support for that, which means JNI to either ICU or Perl. Java has a lot of issues with Unicode support, I’m afraid.

Oh wait, I see you are Portuguese. You should have no problems at all then if you only are dealing with Portuguese text.

However, you don’t really want to remove accents, I bet, but rather you want to be able to match things “accent-insensitively”, right? If so, then you can do so using the ICU4J (ICU for Java) collator class. If you compare at the primary strength, accent marks won’t count. I do this all the time because I often process Spanish text. I have an example of how to do this for Spanish sitting around here somewhere if you need it.

Phenolphthalein answered 18/4, 2011 at 1:0 Comment(4)

So, i must assume that the method given throughout the web (and even here at SO) is not the recommended one for "DeAccent" a word. I made a straight one just for Portuguese, but saw this strange approach (and like you said, it works for my purpose, but so my last method did!). So, is there a better "well implemented" approach that will cover most scenarios? An example would be very nice. Thanks for your time. – Catechetical 18/4, 2011 at 4:11

@Marcolopes: I’ve been leaving the data intact and using the Unicode Collation Algorithm to do primary-strength comparisons. That way it just compares letters, but ignores both case and accent marks. It also lets things that should be the same letter be the same letter, which removing the accents is just a pale and unsatisfactory approximation to. Plus it’s cleaner not to zap the data if you can work with it in a way that does what you want but doesn’t require that. – Phenolphthalein 19/4, 2011 at 1:6

Pretty good answer, One question though, Can I use Normalizer in java and use InCombiningDiacriticalMarks but exclude some characters such as ü from converting to u ? – Marrin 24/3, 2014 at 15:18

yeah, I totally understood all of this – Ergener 19/9, 2014 at 20:34

D

6

Took me a while, but I fished them all out:

Here's regex that should include all the zalgo chars including ones bypassed in 'normal' range.

([\u0300–\u036F\u1AB0–\u1AFF\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F\u0483-\u0486\u05C7\u0610-\u061A\u0656-\u065F\u0670\u06D6-\u06ED\u0711\u0730-\u073F\u0743-\u074A\u0F18-\u0F19\u0F35\u0F37\u0F72-\u0F73\u0F7A-\u0F81\u0F84\u0e00-\u0eff\uFC5E-\uFC62])

Hope this saves you some time.

Departed answered 31/3, 2016 at 10:52 Comment(0)

Recommended topics

Hot tags