I've been using "unicode strings" in Windows for as long as... I've learned about Unicode (e.g. after graduating). However, it always mystified me that the Win32API mentions "unicode" very loosely. In particular, "unicode" variant mentioned by MSN is UTF-16 (although the "wide char" terminology comes from the fact that it used to be UCS-2, which is not Unicode). However, it makes almost no mention of Unicode Normalization.
MSN has a few pages about Unicode and Unicode Normalization Forms and functions to change the normalization form. The page on normalization even says:
Win32 and the .NET Framework support all four normalization forms.
However, I haven't found anywhere in the docs what normalization form is used (or understood) by the Win32 API.
Question 1: what normalization form is used by default for user input (such as an Edit control) and conversion through MultiByteToWideChar()
?
Question 2: must the strings passed to Win32API functions be in a particular normalization form, or are the kernel and file system normalization-agnostic?
MultiByteToWideChar
will just give you the same codepoint sequence that you fed it, just in a different encoding. I guess that also answers Q2. – AlmondMultiByteToWideChar()
says that it just maps whatever input directly. From the remarks section: "Consider calling NormalizeString after converting with MultiByteToWideChar. NormalizeString provides more accurate, standard, and consistent data, and can also be faster." – OverstreetMultiByteToWideChar
allows converting to UTF-16 using theCP_ACP
which may contain some non-ASCII characters that have multiple code points and normlization forms (e.g.é
). – Overstreet