Normalizing Unicode according to the W3C in PHP
Asked Answered
S

2

5

While validating my website's HTML code in the W3C validator I got the following warning:

Line 157, Column 220: Text run is not in Unicode Normalization Form C.

…i͈̭̋ͥ̂̿̄̋̆ͣv̜̺̋̽͛̉͐̀͌̚e͖̼̱ͣ̓ͫ͆̍̄̍͘-̩̬̰̮̯͇̯͆̌ͨ́͌ṁ̸͖̹͎̱̙̱͟͡i̷̡͌͂͏̘̭̥̯̟n̏͐͌̑̄̃͘͞…

I'm developing it in PHP 5.3.x, so I can use the Normalizer class.

So, in order to fix this, should I use Normalizer::normalize($output) when displaying any input made by a user (e.g. a comment) or should I use Normalizer::normalize($input) for any user input before storing it in the database?

tl;dr: should I use Unicode normalization before storing user input in the database or just when it's displayed?

Sampson answered 7/1, 2012 at 1:52 Comment(5)
What kind of data are you displaying on your page? This seems more like a problem of the validator and not of your data.Err
Something like this that a user could legitimately post. It's just a bunch of text with a lot of superscripts and subscripts that looks awful.Sampson
Interesting :) I'm sure that the validator breaks with some kind of combinations of that kind of chars... But I also found this thread comments.gmane.org/gmane.org.w3c.validator/13243Err
Thaks for the link man, I didn't know it was such a complex subject. I guess I'll normalize everthing just in case then.. As long as the validator is happy, the navigators should be tooSampson
Yeah, this thread is endless... :-)Err
K
6

It is up to you to decide, on the basis of the purpose and nature of your application, whether you apply normalization upon reading user input, or storing it to a database, or when writing it, or at all. To summarize the long thread mentioned in the comments to the question, also available in the official list archive at http://validator.w3.org/feedback.html

  • The warning message comes from the experimental “HTML5 validation” (which is really a linter, applying subjective rules in addition to some formal tests).
  • The message is not based on any requirement in HTML5 drafts but on opinions on what might cause problems in some software.
  • The opinions originally made “HTML5 validation” issue an error message, now a warning.

It is certainly possible, though uncommon, to get unnormalized data as user input. This does not depend on normalization carried out by browsers (they don’t do such things, though they conceivably might in the future) but on input methods and habits. For example, methods of typing the letter ü (u umlaut, or u with diaeresis) tend to produce the character in precomposed form, as normalized. People can produce it as unnormalized, in decomposed form, as letter u followed by combining diaeresis, but they usually have no reason to do so, and most people wouldn’t even know how to do that.

If you do string comparisons in your software, they may or may not (depending on comparison routines used) treat e.g. a precomposed ü as equal to the decomposed presentation. Simple implementations treat them as different, as they are definitely distinct at the simple character level (Unicode code points).

One reason to normalize at some point, in the writing phase at the latest, is that precomposed characters generally get displayed more reliably. To present a normalized ü, a program just has to pick up a glyph from a font. To present a decomposed ü, a program must either recognize it as canonically equivalent to the normalized ü or write the letter u with a diaeresis symbol properly placed above it, with due attention to the graphic properties of the glyph for u, and many programs fail in this.

On the other hand, in the rare cases where unnormalized data is received as user input, the user may well have a reason to have produced it. He may have the idea that normalized ü and unnormalized ü are distinct and need to be treated as such.

Kesha answered 7/1, 2012 at 9:15 Comment(4)
Great answer, really detailed and thought through. However, I disagree with the last paragraph... If both of the methods for typing the letter ü (whether it is u umlaut or u with diaeresis) result in ü -the exact same character, with no humanly-visible difference-, why treat them as different things? I am probably wrong here, but wouldn't this be a perfect example in which normalization should be used?Sampson
As text they should be considered equivalent. If there are operations that also treat them as octets, then they can't. An example would be if they had a digital signature - normalising would change it so that it was no longer what was signed. This is the reason that XML Signatures have a normalisation step as part of the actual signing, so it'll only ever be NFC that is signed. When outputting as HTML it'll be output as text and this is irrelevant so it should still be NFC, but you may have a reason for retaining the form sent as well.Disarrange
@John Doe, they do not result in the same character but to a character and a two-character sequence, which are canonically equivalent. Canonical equivalence is not an identity, and programs may treat can. equivalent characters as distinct, though we should not expect programs to do so. Canonical equivalence does not even imply visual identity, due to rendering mechanisms I referred to (e.g., showing precomposed ü by using a glyph directly but showing a decomposed ü by using the “u” glyph and placing “¨” over it—sometimes even taking the diacritic from another font!).Kesha
I believe the many disparate bits of knowledge about encoding, user input and utf8 make this answer the single most informative post I've ever read about utf8.Colloid
D
1

Strictly speaking, the rules of the web character model are not just that one should normalise to NFC, but that both the form before and the form after any technology that includes text from another mechanism is run should be in NFC. Example would be XML includes, character references and entity references. For example, ä would not fit the character model for while it is in NFC expanding the character reference turns it into a followed by a combining diareses, which is not NFC. Mostly avoiding this is pretty easy in practice, but it's worth noting.

There is an interesting case with U+0338. > followed by U+0338 normalises to and with < to produce . The reasons why it should not be allowed at the start of an element name or as the first character within an element should be clear.

As a rule, it makes no sense to have a piece of text start with a combining character in any case, but this particular example allows for the entire document to be mangled (even if you don't normalise, since something else may).

If you are concerned only with the text qua text (digital signatures are of no interest, for example), then normalising on input simplifies the rest of what you do, including your internal use of the text (e.g. searching), so is probably the way to go.

See http://www.w3.org/TR/charmod-norm/ for more.

Disarrange answered 12/1, 2012 at 9:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.