I am working a list of file names in Java.
I observe that some single characters in the file names, like a, ö and ü actually consist of a sequence you could describe as two single ASCII chars following up:
ö
is represented by o
, ¨
I see this by inspection with codePointAt()
. The German name "Rölli" is in fact "Ro¨lli":
...
20: R, 82
21: o, 111
22: ̈, 776
23: l, 108
24: l, 108
25: i, 105
...
The character ¨
in the log above has the value 776, which is a "Combining Diaeresis". This is a so called combining mark that belongs to the graphemes, or more precisely to the combining diacritics. So it all makes sense, but I do not understand what software component combines the two characters to one umlaut, and where this behavior is specified.
- It has nothing to do with the fact that powerful character code tables use several bytes as internal representation. Several bytes are not the same as two combining characters.
- Any simple
print()
of the string shows me the combined character, so it is neither some UI layer above. - I remember to have observed this also with PHP. I guess any modern language can handle this.
What component causes combining characters to be displayed as single combined characters? How reliable is all this?
Has Java a normalization method that makes single code points of combined code points, like here? Would be a help for using Regex...
Thanks a lot for any hint.