I am working on a C project that needs to generate "case insensitive" normalized forms of pieces of Unicode text. I have chosen to define the normalized form as that achieved by first converting to normalization form NFD, then applying the Unicode case folding algorithm, and finally converting the result to Unicode normalization form NFC.
I am relying on ICU's C API for its Unicode representation and utility functions, and it was fairly straightforward to implement my scheme using ICU's unorm_normalize()
and u_strFoldCase()
functions. But one of my tests is failing, and I don't understand why. ICU seems to be generating a different NFC form than I expected.
The input sequence consists of these BMP code points:
U+0020, U+1EA5, U+0328, U+1EC4, U+031C
Via a debugger, I determined that ICU and I agree about the intermediate result after case folding:
U+0020 U+0061 U+0328 U+0302 U+0301 U+0065 U+031C U+0302 U+0303
Note in particular that the earlier conversion to form NFD moved character U+031C into the middle of the decomposition of U+1EC4, as appropriate based on relative CCC numbers for the characters involved. That's part of what I'm trying to test.
Now the good part: according to ICU, the NFC normalization of the folded character sequence is
U+0020 U+0105 U+0302 U+0301 U+1ec5 U+031C
whereas I think it should be
U+0020 U+0105 U+0302 U+0301 U+0065 U+031C U+0302 U+0303
because the three trailing combining characters are already in canonical order, and there is no canonical composition of U+0065 and U+031C.
So, two questions:
- Which is the correct NFC form?
- If ICU is correct then why?