It's probably time to resummarize, also with a view at XML 1.1.
What control character code points are there in Unicode?
U+0000
to U+001f
, inherited from ASCII.
U+007F
, inherited from ASCII
U+0080
to U+009F
, inherited from Latin-1
- various special purpose ranges, standardized explicitly for Unicode, and mostly useful especially in non-markup contexts. They are discussed here block by block, including reasons why and how to use them or to not use them in XML and what to do if you run into them anyway.
How does XML look at those control characters?
This is a different classification.
- Tab and newline (regardless of the platform dependency of what's a newline) are good. Everybody uses them. Everybody knows what they are supposed to stand for. Allowed in almost all known forms, often even for pretty printing of the markup itself.
U+0000
is evil. Null character? String terminator? Binary noise? Antithesis to both interoperability and markup. Forbidden in all forms.
- Anything else? Scarcely used, problematic interoperability, but there are ways to tolerate them even without knowing much about what they are supposed to "control".
Let's now switch our attention to this last category only, control codes proper. That is, the following summary does NOT apply to tabs and newlines: U+0009
, U+000a
, U+000D
, U+0085
, U+2028
.
XML 1.0 allows all the above ranges of control characters, except U+0000
to U+001f
, as text (directly included characters), and as numeric character references. Allowing U+007F
to U+009F
was apparently by omission and this inconsistency was corrected in XML 1.1, but the other way round. They even gave a detailed rationale inside the standard:
Finally, there is considerable demand to define a standard representation of arbitrary Unicode characters in XML documents. Therefore, XML 1.1 allows the use of character references to the control characters #x1 through #x1F, most of which are forbidden in XML 1.0. For reasons of robustness, however, these characters still cannot be used directly in documents. In order to improve the robustness of character encoding detection, the additional control characters #x7F through #x9F, which were freely allowed in XML 1.0 documents, now must also appear only as character references. (Whitespace characters are of course exempt.) The minor sacrifice of backward compatibility is considered not significant. Due to potential problems with APIs, #x0 is still forbidden both directly and as a character reference.
Why does Unicode and XML allow free use of markup-like control characters, apart from the few "inherited" ranges? People should be using markup for those.
Unicode is also used in non-markup contexts, and it is a still evolving character set. It would be too difficult to implement a conforming XML processor if the set of non-control characters was a moving target.
OK, what's wrong with the inherited ranges then, compared to the Unicode-specific control characters?
Lack of standardization. The Unicode consortium didn't really get to choose which numbers are assigned to those "characters", or what is their typical visual presentation or meaning. Full backward compatibility with ASCII (on encoded UTF-8 level) and with Latin-1 (on code point assignment level) forced raw inclusion of these code points regardless of the various specialized and overloaded meanings often attached to them in various text processing contexts.
Wait, are you saying that XML isn't meant to be fully backward compatible with ASCII, unlike UTF-8?
Yeah. That's correct. You need a document element. You can't even put in a raw <
or &
. So why would you ever need to put in raw control characters?