Asked 28/10, 2011 at 15:14 Answered 28/10, 2011 at 20:13

Solved php c unicode unicode-normalization

147

The ICU project (which also now has a PHP library) contains the classes needed to help normalize UTF-8 strings to make it easier to compare values when searching.

However, I'm trying to figure out what this means for applications. For example, in which cases do I want "Canonical Equivalence" instead of "Compatibility equivalence", or vis-versa?

Loera answered 28/10, 2011 at 15:14 Comment(5)

w͢͢͝h͡o͢͡ ̸͢k̵͟n̴͘ǫw̸̛s͘ ̀́w͘͢ḩ̵a҉̡͢t ̧̕h́o̵r͏̵rors̡ ̶͡͠lį̶e͟͟ ̶͝in͢ ͏t̕h̷̡͟e ͟͟d̛a͜r̕͡k̢̨ ͡h̴e͏a̷̢̡rt́͏ ̴̷͠ò̵̶f̸ u̧͘ní̛͜c͢͏o̷͏d̸͢e̡͝?͞ – Cabob 28/10, 2011 at 15:25

@Cabob I really want to know whether those extra symbols can have states or not – Leaden 15/11, 2013 at 22:27

@Eonil - I'm not sure what state means in the context of unicode. – Cabob 15/11, 2013 at 22:28

@Cabob For example, some code point like this: (begin curved line) (char1) (char2) … (charN) (end curved line) rather than this: (curved line marker prefix) (char1) (curved line marker prefix) (char2) (curved line marker prefix) (char2). In other words, minimal unit which can be rendered? – Leaden 15/11, 2013 at 22:52

That sounds like a good question on its own. – Cabob 16/11, 2013 at 3:16

206

Everything You Never Wanted to Know about Unicode Normalization

Canonical Normalization

Unicode includes multiple ways to encode some characters, most notably accented characters. Canonical normalization changes the code points into a canonical encoding form. The resulting code points should appear identical to the original ones barring any bugs in the fonts or rendering engine.

When To Use

Because the results appear identical, it is always safe to apply canonical normalization to a string before storing or displaying it, as long as you can tolerate the result not being bit for bit identical to the input.

Canonical normalization comes in 2 forms: NFD and NFC. The two are equivalent in the sense that one can convert between these two forms without loss. Comparing two strings under NFC will always give the same result as comparing them under NFD.

NFD

NFD has the characters fully expanded out. This is the faster normalization form to calculate, but the results in more code points (i.e. uses more space).

If you just want to compare two strings that are not already normalized, this is the preferred normalization form unless you know you need compatibility normalization.

NFC

NFC recombines code points when possible after running the NFD algorithm. This takes a little longer, but results in shorter strings.

Compatibility Normalization

Unicode also includes many characters that really do not belong, but were used in legacy character sets. Unicode added these to allow text in those character sets to be processed as Unicode, and then be converted back without loss.

Compatibility normalization converts these to the corresponding sequence of "real" characters, and also performs canonical normalization. The results of compatibility normalization may not appear identical to the originals.

Characters that include formatting information are replaced with ones that do not. For example the character ⁹ gets converted to 9. Others don't involve formatting differences. For example the roman numeral character Ⅸ is converted to the regular letters IX.

Obviously, once this transformation has been performed, it is no longer possible to losslessly convert back to the original character set.

When to use

The Unicode Consortium suggests thinking of compatibility normalization like a ToUpperCase transform. It is something that may be useful in some circumstances, but you should not just apply it willy-nilly.

An excellent use case would be a search engine since you would probably want a search for 9 to match ⁹.

One thing you should probably not do is display the result of applying compatibility normalization to the user.

NFKC/NFKD

Compatibility normalization form comes in two forms NFKD and NFKC. They have the same relationship as between NFD and C.

Any string in NFKC is inherently also in NFC, and the same for the NFKD and NFD. Thus NFKD(x)=NFD(NFKC(x)), and NFKC(x)=NFC(NFKD(x)), etc.

Conclusion

If in doubt, go with canonical normalization. Choose NFC or NFD based on the space/speed trade-off applicable, or based on what is required by something you are inter-operating with.

Belford answered 28/10, 2011 at 20:13 Comment(6)

A quick reference to remember what the abbreviations stand for: NF = normalized form D = decompose (decompress), C = compose (compress) K = compatibility (since "C" was taken). – Hallerson 29/10, 2011 at 2:38

You always want to NFD all strings on input as the very first thing, and NFC all strings output as the very last thing. This is well known. – Rodas 31/10, 2011 at 1:9

@tchrist: That is generally good advice, except in the rare cases where you desire the output to be byte for byte identical to the input when no changes are made. There are some other cases where you want NFC in memory or NFD on disk, but they are the exeption rather than the rule. – Belford 31/10, 2011 at 18:45

@Kevin: Yes, NFD in and NFC out will destroy the singletons. I'm not sure that anyone cares about those, but possibly. – Rodas 31/10, 2011 at 22:2

"Comparing two strings under NFC will always give the same result as comparing them under NFD.", but according to normalization stability section "[...] if a string that does not have any unassigned characters is normalized under one version of Unicode, it must remain normalized under all future versions of Unicode." So if Q-caron is introduced in a later version and you try to compare Q + caron containing string to Q-caron string, the NFC form would not be equivalent, but NFD form should. Is that right? – Gangling 29/9, 2013 at 0:14

You might think that, but from the annex: "To transform a Unicode string into a given Unicode Normalization Form, the first step is to fully decompose the string". Thus even wehn running NFC, Q-Caron would first become become Q+Caron, and could not recompose, since the stability rules prohibit adding the new composition mapping. NFC is effectively defined as NFC(x)=Recompose(NFD(x)). – Belford 30/9, 2013 at 14:33

Some characters, for example a letter with an accent (say, é) can be represented in two ways - a single code point U+00E9 or the plain letter followed by a combining accent mark U+0065 U+0301. Ordinary normalization will choose one of these to always represent it (the single code point for NFC, the combining form for NFD).

For characters that could be represented by multiple sequences of base characters and combining marks (say, "s, dot below, dot above" vs putting dot above then dot below or using a base character that already has one of the dots), NFD will also pick one of these (below goes first, as it happens)

The compatibility decompositions include a number of characters that "shouldn't really" be characters but are because they were used in legacy encodings. Ordinary normalization won't unify these (to preserve round-trip integrity - this isn't an issue for the combining forms because no legacy encoding [except a handful of vietnamese encodings] used both), but compatibility normalization will. Think like the "kg" kilogram sign that appears in some East Asian encodings (or the halfwidth/fullwidth katakana and alphabet), or the "fi" ligature in MacRoman.

See http://unicode.org/reports/tr15/ for more details.

Haematoblast answered 28/10, 2011 at 15:39 Comment(1)

This is indeed the correct answer. If you use just canonical normalization on text that originated in some legacy character set, the result can be converted back into that character set without loss. If you use compatibility decomposition, you end up without any compatibility characters, but it is no longer possible to convert back to the original character set without loss. – Belford 28/10, 2011 at 18:21

Normal forms (of Unicode, not databases) deal primarily (exclusively?) with characters that have diacritical marks. Unicode provides some characters with "built in" diacritical marks, such as U+00C0, "Latin Capital A with Grave". The same character can be created from a `Latin Capital A" (U+0041) with a "Combining Grave Accent" (U+0300). That means even though the two sequences produce the same resulting character, a byte-by-byte comparison will show them as being completely different.

Normalization is an attempt at dealing with that. Normalizing assures (or at least tries to) that all the characters are encoded the same way -- either all using a separate combining diacritical mark where needed, or all using a single code point wherever possible. From a viewpoint of comparison, it doesn't really matter a whole lot which you choose -- pretty much any normalized string will compare properly with another normalized string.

In this case, "compatibility" means compatibility with code that assumes that one code point equals one character. If you have code like that, you probably want to use the compatibility normal form. Although I've never seen it stated directly, the names of the normal forms imply that the Unicode consortium considers it preferable to use separate combining diacritical marks. This requires more intelligence to count the actual characters in a string (as well as things like breaking a string intelligently), but is more versatile.

If you're making full use of ICU, chances are that you want to use the canonical normal form. If you're trying to write code on your own that (for example) assumes a code point equals a character, then you probably want the compatibility normal form that makes that true as often as possible.

Ramah answered 28/10, 2011 at 15:36 Comment(4)

So this is the part where the Grapheme Functions come in then. Not only is the character more bytes than ASCII - but multiple sequences can be a single character right? (As opposed to the MB string functions.) – Loera 28/10, 2011 at 15:39

No, the 'one code point is one character' corresponds roughly to NFC (the one with the combining marks is NFD, and neither of them is "compatibility") - The compatibility normalizations NFKC/NFKD are a different issue; compatibility (or lack thereof) for legacy encodings that e.g. had separate characters for greek mu and 'micro' (that's a fun one to bring up because the "compatibility" version is the one that's in the Latin 1 block) – Haematoblast 28/10, 2011 at 15:42

@Random832: Oops, quite right. I should know better than to go from memory when I haven't worked with it for the last year or two. – Ramah 28/10, 2011 at 16:18

@Haematoblast That is not true. Your “roughly” is too out there. Consider the two graphemes, ō̲̃ and ȭ̲. There are many many ways to write each of those, of which exactly one each is NFC and one NFD, but others also exist. It no case is that only one code point. NFD for the first is "o\x{332}\x{303}\x{304}", and NFC is "\x{22D}\x{332}". For the second NFD is "o\x{332}\x{304}\x{303}" and NFC is "\x{14D}\x{332}\x{303}". However, many non-canonical possibilities exist which are canonically equivalent to these. Normalization allows binary comparison of canonically equivalent graphemes. – Rodas 31/10, 2011 at 1:20

If two unicode strings are canonically equivalent the strings are really the same, only using different unicode sequences. For example Ä can be represented either using the character Ä or a combination of A and ◌̈.

If the strings are only compatibility equivalent the strings aren't necessarily the same, but they may be the same in some contexts. E.g. ﬀ could be considered same as ff.

So, if you are comparing strings you should use canonical equivalence, because compatibility equivalence isn't real equivalence.

But if you want to sort a set of strings it might make sense to use compatibility equivalence as the are nearly identical.

Doings answered 28/10, 2011 at 15:38 Comment(0)

This is actually fairly simple. UTF-8 actually has several different representations of the same "character". (I use character in quotes since byte-wise they are different, but practically they are the same). An example is given in the linked document.

The character "Ç" can be represented as the byte sequence 0xc387. But it can also be represented by a C (0x43) followed by the byte sequence 0xcca7. So you can say that 0xc387 and 0x43cca7 are the same character. The reason that works, is that 0xcca7 is a combining mark; that is to say it takes the character before it (a C here), and modifies it.

Now, as far as the difference between canonical equivalence vs compatibility equivalence, we need to look at characters in general.

There are 2 types of characters, those that convey meaning through the value, and those that take another character and alter it. 9 is a meaningful character. A super-script ⁹ takes that meaning and alters it by presentation. So canonically they have different meanings, but they still represent the base character.

Canonical equivalence is where the byte sequence is rendering the same character with the same meaning. Compatibility equivalence is when the byte sequence is rendering a different character with the same base meaning (even though it may be altered). The 9 and ⁹ are compatibility equivalent since they both mean "9", but are not canonically equivalent since they don't have the same representation.

Chas answered 28/10, 2011 at 15:42 Comment(1)

@tchrist: Read the answer again. I never even made mention of the different ways to represent the same code point. I said there are multiple ways of representing the same printed character (via combinators and multiple characters). Which applies to both UTF-8 and Unicode. So your downvote and comment don't really apply at all to what I said. In fact, I basically was making the same point that the top poster here made (albeit not as well)... – Chas 31/10, 2011 at 13:40

Whether canonical equivalence or compatibility equivalence is more relevant to you depends on your application. The ASCII way of thinking about string comparisons roughly maps to canonical equivalence, but Unicode represents a lot of languages. I don't think it is safe to assume that Unicode encodes all languages in a way that allows you to treat them just like western european ASCII.

Figures 1 and 2 provide good examples of the two types of equivalence. Under compatibility equivalence, it looks like the same number in sub- and super- script form would compare equal. But I'm not sure that solve the same problem that as the cursive arabic form or the rotated characters.

The hard truth of Unicode text processing is that you have to think deeply about your application's text processing requirements, and then address them as well as you can with the available tools. That doesn't directly address your question, but a more detailed answer would require linguistic experts for each of the languages you expect to support.

Cabob answered 28/10, 2011 at 15:38 Comment(0)

The problem of compare strings: two strings with content that is equivalent for the purposes of most applications may contain differing character sequences.

See Unicode's canonical equivalence: if the comparison algorithm is simple (or must be fast), the Unicode equivalence is not performed. This problem occurs, for instance, in XML canonical comparison, see http://www.w3.org/TR/xml-c14n

To avoid this problem... What standard to use? "expanded UTF8" or "compact UTF8"?
Use "ç" or "c+◌̧."?

W3C and others (ex. file names) suggest to use the "composed as canonical" (take in mind C of "most compact" shorter strings)... So,

The standard is C! in doubt use NFC

For interoperability, and for "convention over configuration" choices, the recommendation is the use of NFC, to "canonize" external strings. To store canonical XML, for example, store it in the "FORM_C". The W3C's CSV on the Web Working Group also recomend NFC (section 7.2).

PS: de "FORM_C" is the default form in most of libraries. Ex. in PHP's normalizer.isnormalized().

Ther term "compostion form" (FORM_C) is used to both, to say that "a string is in the C-canonical form" (the result of a NFC transformation) and to say that a transforming algorithm is used... See http://www.macchiato.com/unicode/nfc-faq

(...) each of the following sequences (the first two being single-character sequences) represent the same character:

U+00C5 ( Å ) LATIN CAPITAL LETTER A WITH RING ABOVE

U+212B ( Å ) ANGSTROM SIGN

U+0041 ( A ) LATIN CAPITAL LETTER A + U+030A ( ̊ ) COMBINING RING ABOVE

These sequences are called canonically equivalent. The first of these forms is called NFC - for Normalization Form C, where the C is for compostion. (...) A function transforming a string S into the NFC form can be abbreviated as toNFC(S), while one that tests whether S is in NFC is abbreviated as isNFC(S).

Note: to test of normalization of little strings (pure UTF-8 or XML-entity references), you can use this test/normalize online converter.

Levorotatory answered 28/10, 2011 at 15:14 Comment(2)

I'm confused. I went to this online tester page and I enter there: "TÖST MÉ pleasé." and try all 4 of given normalizations - none changes my text in any way, well, except that it changes the codes used to present those chars. Am I wrongly thinking that "normalization" means "remove all the diacritics and similar", and it actually means - just change the utf coding beneath? – Ellersick 18/1, 2017 at 17:9

Hi @Ellersick perhaps you need a position, about application: is to compare or to standardize your text? My post here is only about "to standardize" applications. PS: when all the world use standard, the compare problem vanishes. – Levorotatory 20/1, 2017 at 10:2

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++