Problem:
When whitespace is insignificant, representation may be very significant.
Explanation:
In XML Schema Part 2: Datatypes Second Edition the constraining facet whiteSpace is defined for types derived from string (http://www.w3.org/TR/xmlschema-2/#rf-whiteSpace). If this whiteSpace facet is replace or collapse, the value may be changed during normalization.
There is a note at the end of Section 4.3.6:
The notation #xA used here (and elsewhere in this specification) represents the Universal Character Set (UCS) code point hexadecimal A (line feed), which is denoted by U+000A. This notation is to be distinguished from 
, which is the XML character reference to that same UCS code point.
Example:
If the datatype for an element elem has a whitespace constraint collapse, "<elem> text </elem>"
should become "text"
(leading and trailing whitespace removed), but "<elem> text </elem>"
should become " text "
(whitespace encoded by character reference not removed).
Questions:
So either the parser/tree builder handles this normalization or this is done afterwards.
- Informed parsing:
- Where do I provide the parser or tree builder with the information on how to normalize some XML element?
- Is there something like
set_whitespace_normalization('./country/neighbor', 'collapse')
? - Is there a hook like
normalize(content)
in the parser or tree builder?
- Post processing
- How do I access the original content of some element?
- Is there a
elem.original_text
, that may return " text 
"? - Is there a
elem.unnormalized_text
, that may return "text
"?
I would like to use Python's xml.etree.ElementTree but I will consider any other XML library that does the job.
Disclaimer:
Of course it is bad style to declare whitespace insignificant (replace or collapse) and then to cheat by using character references. In most cases either the data or the schema should be changed to prevent that, but sometimes you have to work with foreign XML schemata and foreign XML documents. And the sheer existence of the note cited above indicates that the XML editors were aware of this dilemma and did deliberately not prevent it.