When to use Unicode Normalization Forms NFC and NFD?
Asked Answered
C

3

35

The Unicode Normalization FAQ includes the following paragraph:

Programs should always compare canonical-equivalent Unicode strings as equal ... The Unicode Standard provides well-defined normalization forms that can be used for this: NFC and NFD.

and continues...

The choice of which to use depends on the particular program or system. NFC is the best form for general text, since it is more compatible with strings converted from legacy encodings. ... NFD and NFKD are most useful for internal processing.

My questions are:

What makes NFC best for "general text." What defines "internal processing" and why is it best left to NFD? And finally, never minding what is "best," are the two forms interchangable as long as two strings are compared using the same normalization form?

Coverlet answered 13/4, 2013 at 8:37 Comment(1)
«NFC is the best form for general text, since it is more compatible with strings converted from legacy encodings. ... NFD and NFKD are most useful for internal processing.» are somewhat bogus statements. While legacy strings may come in a form that when converted to Unicode is in NFC form, for future maintenance (code always ends up being used in unforeseen conditions) you'll be better if you do the conversion to NF[CD] explicitly.Ensue
C
13

The FAQ is somewhat misleading, starting from its use of “should” followed by the inconsistent use of “requirement” about the same thing. The Unicode Standard itself (cited in the FAQ) is more accurate. Basically, you should not expect programs to treat canonically equivalent strings as different, but neither should you expect all programs to treat them as identical.

In practice, it really depends on what your software needs to do. In most situations, you don’t need to normalize at all, and normalization may destroy essential information in the data.

For example, U+0387 GREEK ANO TELEIA (·) is defined as canonical equivalent to U+00B7 MIDDLE DOT (·). This was a mistake, as the characters are really distinct and should be rendered differently and treated differently in processing. But it’s too late to change that, since this part of Unicode has been carved into stone. Consequently, if you convert data to NFC or otherwise discard differences between canonically equivalent strings, you risk getting wrong characters.

There are risks that you take by not normalizing. For example, the letter “ä” can appear as a single Unicode character U+00E4 LATIN SMALL LETTER A WITH DIAERESIS or as two Unicode characters U+0061 LATIN SMALL LETTER A U+0308 COMBINING DIAERESIS. It will mostly be the former, i.e. the precomposed form, but if it is the latter and your code tests for data containing “ä”, using the precomposed form only, then it will not detect the latter. But in many cases, you don’t do such things but simply store the data, concatenate strings, print them, etc. Then there is a risk that the two representations result in somewhat different renderings.

It also matters whether your software passes character data to other software somehow. The recipient might expect, due to naive implicit assumptions or consciously and in a documented manner, that its input is normalized.

Cool answered 13/4, 2013 at 11:40 Comment(5)
One place where U+0061 LATIN SMALL LETTER A U+0308 COMBINING DIAERESIS would be the way to express “ä” would be Max OS X filenames, which require a specific version of NFD.Tulley
@Tulley is that documented somewhere?Dora
@Keith4G: There should be questions about it on SO. Let me have a look for you. I'm not a Mac guy but years ago did some stuff to read Mac partitions for fun and ran into this.Tulley
Technical Note TN1150 / HFS Plus Volume Format /Unicode SubtletiesTulley
I was having trouble looking for specific information about OS X normalization. ThanksDora
C
8
  1. NFC is the general common sense form that you should use, ä is 1 code point there and that makes sense.

  2. NFD is good for certain internal processing - if you want to make accent-insensitive searches or sorting, having your string in NFD makes it much easier and faster. Another usage is making more robust slug titles. These are just the most obvious ones, I am sure there are plenty of more uses.

  3. If two strings x and y are canonical equivalents, then
    toNFC(x) = toNFC(y)
    toNFD(x) = toNFD(y)

    Is that what you meant?

Cowell answered 13/4, 2013 at 10:44 Comment(3)
Re 3, I don't think that's always the case. E.g. (from Wikipedia) string 1 contains "U+212B" (the angstrom sign "Å"), string 2 contains "U+0041 U+030A" (Latin letter "A" and combining ring above "°"). Under NFD, they are equivalent, but under NFC string 2 is converted to "U+00C5" (the Swedish letter "Å"), so the two are not equivalent. It seems to me that NFD is the safest choice. en.wikipedia.org/wiki/Unicode_equivalence#Normal_formsBacciferous
@Bacciferous it's from unicode website unicode.org/reports/tr15/tr15-18.htmlCowell
You're absolutely right, I was about to change my comment after reading more about this issue. The key here is that to go to NFC you first convert to NFD.Bacciferous
D
2

NFC

  • requires less space when storing it (equals less RAM in your process).

  • is quite often faster to process, e.g. when comparing strings or converting to a different encoding (less bytes means less bytes must be processed).

  • is what most conversion code produces naturally (if 1-to-1 mapping is possible, why would it prefer to convert one char into multiple Unicode code points? With 1-to-1 a very simple mapping table will do the trick); so if you know that your string conversion function produces NFC, you can use the string as it comes out of it.

  • is what most text/source editors produce anyway and thus also what most strings in files (including source code) are encoded as, so again, if you load text from a file you created and that is known to be NFC, no extra processing is required.

NFD

  • usually requires less CPU time to produce, since NFC is typically created out of NFD (any -> NFD -> NFC; so with NFD, you can stop half the way).

  • makes some very specific operations way easier and faster, e.g. comparing characters without modifiers (when á and ä and a should all be treated as equal characters, e.g. during a search, just skip the modifiers) or when stripping the modifiers (just leave the modifiers out while copying to a new string).

  • is the base for many Unicode transformation operations defined by the standard and if your strings are NFC, you first will have to convert them (possibly back) to NFD first.

Does it matter what you choose for your code?

Actually not. As long as you convert and store all strings to the same form, equivalent strings will always consist out of the same code points (also in the same order, as the order is specified, too) and the same byte sequence in memory and thus can be directly compared with a simple memory compare. Also every standard conform UI will display exactly the same text on screen for either of both representations.

Note however, that an Unicode string that mixes NFC and NFD, as well as one that uses code point combinations that would not qualify as either NFC or NFD, is also still a totally valid Unicode string, is also considered to be equivalent to its NFC/NFD form, and would display the same in a standard conform UI. Such a string is allowed by the standard, it's just not normalized and cannot be directly compared to either NFC or NFD strings. That means no matter what you pick for your code, unless you are in control of the string creation and thus know for sure which form it has, you must treat all strings coming in from external sources as non-normalized and first normalize them to whatever you picked for your code if you want to directly compare them or use them with functions that expect any kind of normalized form.

Disc answered 6/11, 2023 at 14:58 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.