Unicode Normalization in Windows
Asked Answered
D

3

24

I've been using "unicode strings" in Windows for as long as... I've learned about Unicode (e.g. after graduating). However, it always mystified me that the Win32API mentions "unicode" very loosely. In particular, "unicode" variant mentioned by MSN is UTF-16 (although the "wide char" terminology comes from the fact that it used to be UCS-2, which is not Unicode). However, it makes almost no mention of Unicode Normalization.

MSN has a few pages about Unicode and Unicode Normalization Forms and functions to change the normalization form. The page on normalization even says:

Win32 and the .NET Framework support all four normalization forms.

However, I haven't found anywhere in the docs what normalization form is used (or understood) by the Win32 API.

Question 1: what normalization form is used by default for user input (such as an Edit control) and conversion through MultiByteToWideChar()?

Question 2: must the strings passed to Win32API functions be in a particular normalization form, or are the kernel and file system normalization-agnostic?

Drabeck answered 12/8, 2011 at 13:49 Comment(3)
I think your Q1 is conflating to unrelated ideas: the conversion functions only convert between different binary representations of the same logical string of unicode codepoints (e.g. UTF8 and UTF16). However, normalization is a high-level concept involving only the logical sequence of codepoints. The two have nothing to do with one another. In particular, MultiByteToWideChar will just give you the same codepoint sequence that you fed it, just in a different encoding. I guess that also answers Q2.Almond
Indeed, the documentation for MultiByteToWideChar() says that it just maps whatever input directly. From the remarks section: "Consider calling NormalizeString after converting with MultiByteToWideChar. NormalizeString provides more accurate, standard, and consistent data, and can also be faster."Overstreet
@KerrekSB: Sorry to revive this really old thread, but I stumbled onto it again today and re-read your comment. The thing is, you assume UTF-8 to UTF-16 conversion and MultiByteToWideChar allows converting to UTF-16 using the CP_ACP which may contain some non-ASCII characters that have multiple code points and normlization forms (e.g. é).Overstreet
D
14

From the MSDN article Using Unicode Normalization to Represent Strings.

Windows, Microsoft applications, and the .NET Framework generally generate characters in form C using normal input methods. For most purposes on Windows, form C is the preferred form. For example, characters in form C are produced by Windows keyboard input. However, characters imported from the Web and other platforms can introduce other normalization forms into the data stream.

Update: I've included some specific details relating to Question #2.

In regards to the file system, normalization is not required - based on the article Naming Files, Paths, and Namespaces.

There is no need to perform any Unicode normalization on path and file name strings for use by the Windows file I/O API functions because the file system treats path and file names as an opaque sequence of WCHARs. Any normalization that your application requires should be performed with this in mind, external of any calls to related Windows file I/O API functions.

In regards to SQL Server, no normalization is required - nor is data normalized when saved in the database. That said, when comparing strings, SQL Server 2000 uses its own string normalization mechanism inside of indexes; but I cannot find specific details on what that is. A SQL Server 2005 article states the same.

One important change in SQL Server 7.0 was the provision of an operating system–independent model for string comparison, so that the collations between all operating systems from Windows 95 through Windows 2000 would be consistent. This string comparison code was based on the same code that Windows 2000 uses for its own string normalization, and is encapsulated to be the same on all computers and in all versions of SQL Server.

Dreiser answered 13/8, 2011 at 5:21 Comment(2)
I'm accepting this answer because its refers to official documentation. However, the conclusion from all the answers is that strings returned by system functions are usually in form C and there's no real guarantee that this is the case. If a specific normalization form is required, all strings should be normalized manually.Overstreet
The quote about normalization in path and file name strings seems to have been removed from the quoted web page.Wilie
C
9

what normalization form is used by default for user input

Depends on your keyboard layout/IME. It's possible to generate normal form C, D, or a crazy mixture of both if you want.

Keyboard layouts tend towards NFC because in the pre-Unicode days they'd've usually been outputting a single byte character in the local code page for each keypress. However there are exceptions.

For example using the Windows Vietnamese keyboard layout, some diacritics are typed as a single keypress combined with the letter (eg circumflex â) and some are typed as a combining diacritical (eg grave ). The graheme a-with-circumflex-and-grave would be typed as a-circumflex followed by combining-grave, ầ, which would be 0xE2,0xCC in Vietnamese code page 1258, and would come out as U+00E2,U+0300 in Unicode.

This isn't in normal form C (which would be U+1EA7 Latin small letter A with circumflex and grave) nor D (which would be ầ U+0061,U+0302,U+0300).

There is generally a cultural preference for NFC in the Windows world and on the web, and for NFD in the Apple world. But it's not rigorously enforced and you should expect to cope with any mixture of combined and decomposed characters.

are the kernel and file system normalization-agnostic?

Yes, the kernel and filesystem don't know anything about normalisation and will quite happily allow you to have files with the names ầ.txt, ầ.txt and ầ.txt in the same folder.

Chalybite answered 13/8, 2011 at 13:13 Comment(5)
Regarding your last point: if the kernel and file system probably still distinguish between the same string in two normalization forms, so that it prevents you from having two files with the "same" name "à" in NFC and NFD. That's what I meant by the "normalization agnostic", referring to handling all unicode forms equally well.Overstreet
Perhaps “normalisation-ignorant” would be a clearer way of putting it: to Windows they're just a bunch of code points. The only ‘clever’ thing it tries to do is match them case-insensitively. This is tricky enough as it is given that case folding rules have changed in different Unicode revisions!Chalybite
It's been a while since I asked this question, and I was re-reading your post. Perhaps what I should have asked as the second question is: "is the kernel Unicode-smart"? For example, if you request for a file with a name in NFD, will it match if the file was created with a path in NFC (or a mixture, or whatever)?Overstreet
@André: no, indeed, ‘smart’ it is not. An NFC and NFD string are different at the string handling level in general, and specifically so in the NTFS filesystem. So, yeah, having a user manually type a filepath to match can be a pain. But at least when you read the filename back from the filesystem, you get it in the same form you put in... that isn't the case on OS X (HFS+/UFS), which forces everything to NFD, causing nasty interop problems.Chalybite
Indeed, SVN has (had?) a nasty problem with NFC VS. NFD storage. I think reading about that issue is what triggered this question in the first place. Reading the file name back into the original encoding is a nice property, but it's orthogonal to proper comparison between strings in different normalization forms. I'm sure ICU has a string comparison function that does not require both strings to be in the same normalization form.Overstreet
M
2

First of all, thanks for an excellent question. I found the answer in Michael Kaplan's blog:

But since all of the methods of text input on Windows tend to use the same normalization form already (form C), ...

Magritte answered 12/8, 2011 at 15:36 Comment(2)
Nice find. Although Michael is a developer at Microsoft, this passage is rather unofficial, to say the least. Any idea if this is documented somewhere official?Overstreet
@André Caron While Michael Kaplan's blog isn't necessarily official, it includes some of the best information regarding Unicode / Internationalization on Windows. Every one of my Unicode questions over the past few years has invariably led to his blog.Dreiser

© 2022 - 2024 — McMap. All rights reserved.