What component handles a Combining Diaeresis in a string?
Asked Answered
E

1

5

I am working a list of file names in Java.

I observe that some single characters in the file names, like a, ö and ü actually consist of a sequence you could describe as two single ASCII chars following up:

ö is represented by o, ¨

I see this by inspection with codePointAt(). The German name "Rölli" is in fact "Ro¨lli":

...
20: R, 82
21: o, 111
22: ̈, 776
23: l, 108
24: l, 108
25: i, 105
...

The character ¨ in the log above has the value 776, which is a "Combining Diaeresis". This is a so called combining mark that belongs to the graphemes, or more precisely to the combining diacritics. So it all makes sense, but I do not understand what software component combines the two characters to one umlaut, and where this behavior is specified.

  • It has nothing to do with the fact that powerful character code tables use several bytes as internal representation. Several bytes are not the same as two combining characters.
  • Any simple print() of the string shows me the combined character, so it is neither some UI layer above.
  • I remember to have observed this also with PHP. I guess any modern language can handle this.

What component causes combining characters to be displayed as single combined characters? How reliable is all this?

Has Java a normalization method that makes single code points of combined code points, like here? Would be a help for using Regex...

Thanks a lot for any hint.

Ergonomics answered 4/11, 2015 at 10:34 Comment(0)
E
7

Answer 1: Specification and responsibility

The behavior you describe is defined in Unicode Standard Annex #15, Unicode Normalization Forms. This is about the equivalency of combined chars and single code points and about the decomposition of code points. Many languages other then German heavily rely on composing graphemes.

Java internally represents strings as UTF-16. So all it does with its String class is delivering UTF-16 code chains to other components. It is up to the surrounding software (e.g. any kind of text view components) to combine the chains correctly. You feel this in moments where e.g. a regex breaks your combined ö apart, yet it is shown correctly in some view.

By the way, if you do some experiments with the Combining Diaeresis, be aware that there is also a "non-functional" code 168, which is a simple ASCII character called "Spacing Diaeresis". Code 168 does not cause any software to combining two code points to one. For this you need the Unicode 776.

Answer 2: Javas normalization method

Basically, you should always take combined chars into account - except you are sure that your data source cannot deliver them. It's a good idea to sanitize your strings first.

Look for unicode normalizing methods in your language, as they release you from fiddling with single replace() statements and they contain a lot of experience.

Java has a Normalizerobject that deals with different representations of combined characters:

https://docs.oracle.com/javase/7/docs/api/java/text/Normalizer.html

and the tutorial for it: https://docs.oracle.com/javase/tutorial/i18n/text/normalizerapi.html

So after invoking this code line:

String normalized = Normalizer.normalize(someFileName, Normalizer.Form.NFC);

the log print from the question above looks like this:

...
19:  , 32
20: R, 82
21: ö, 246   <<< here were two combined chars before normalize()
22: l, 108
23: l, 108
24: i, 105
...
Ergonomics answered 4/11, 2015 at 10:34 Comment(1)
I didn't have the same question, but this really helped narrow down the problem with #55929481Bello

© 2022 - 2024 — McMap. All rights reserved.