XmlTextWriter incorrectly writing control characters
Asked Answered
U

3

15

.NET's XmlTextWriter creates invalid xml files.

In XML, some control characters are allowed, like 'horizontal tab' (	), but others are not, like 'vertical tab' (). (See spec.)

I have a string which contains a UTF-8 control character that is not allowed in XML.
Although XmlTextWriter escapes the character, the resulting XML is ofcourse still invalid.

How can I make sure that XmlTextWriter never produces an illegal XML file?

Or, if it's not possible to do this with XmlTextWriter, how can I strip the specific control characters that aren't allowed in XML from a string?

Example code:

using (XmlTextWriter writer =
  new XmlTextWriter("test.xml", Encoding.UTF8))
{
  writer.WriteStartDocument();
  writer.WriteStartElement("Test");
  writer.WriteValue("hello \xb world");
  writer.WriteEndElement();
  writer.WriteEndDocument();
}

Output:

<?xml version="1.0" encoding="utf-8"?><Test>hello &#xB; world</Test>
Uzbek answered 24/11, 2011 at 11:3 Comment(2)
You can't have an escaped vertical tab in XML? Could you reference the standard?Vic
@Vic That's right, you can't. XML is for text, not for control characters or binary data. w3.org/TR/REC-xml/#charsetsMaddis
M
14

This documentation of a behaviour is hidden in the documentation of the WriteString method but it sounds like it applies to the whole class.

The default behavior of an XmlWriter created using Create is to throw an ArgumentException when attempting to write character values in the range 0x-0x1F (excluding white space characters 0x9, 0xA, and 0xD). These invalid XML characters can be written by creating the XmlWriter with the CheckCharacters property set to false. Doing so will result in the characters being replaced with numeric character entities (&#0; through &#0x1F). Additionally, an XmlTextWriter created with the new operator will replace the invalid characters with numeric character entities by default.

So it seems that you end up writing invalid characters because you are using the XmlTextWriter class. A better solution for you would be to use the XmlWriter Class instead.

Maddis answered 24/11, 2011 at 11:55 Comment(1)
It's a bit weird, but apparently even though the XmlTextWriter constructor exists, you're not supposed to use it: msdn.microsoft.com/en-us/library/kkz7cs0d.aspxUzbek
B
6

Just found this question when I was struggling with the same issue and I ended up solving it with an regex:

return Regex.Replace(s, @"[\u0000-\u0008\u000B\u000C\u000E-\u001F]", "");

Hope it helps someone as an alternative solution.

Balalaika answered 26/6, 2013 at 21:31 Comment(0)
D
1

Built in .NET escapers such as SecurityElement.Escape don't properly escape/strip it either.

  • You could set CheckCharacters to false on both the writer and the reader if your application is the only one interacting with the file. The resulting XML file would still be technically invalid though.

See:

XmlWriterSettings xmlWriterSettings = new XmlWriterSettings();
xmlWriterSettings.Encoding = new UTF8Encoding(false);
xmlWriterSettings.CheckCharacters = false;
var sb = new StringBuilder();
var w = XmlWriter.Create(sb, xmlWriterSettings);
w.WriteStartDocument();
w.WriteStartElement("Test");
w.WriteString("hello \xb world");
w.WriteEndElement();
w.WriteEndDocument();
w.Close();
var xml = sb.ToString();
  • If setting CheckCharacters to true(which it is by default) is a bit too strict since it will simply throw an exception an alternative approach that's more lenient to invalid XML characters would be to just strip them:

Googling a bit yielded the whitelist XmlTextEncoder however it'll also remove DEL and others in the range U+007F–U+0084, U+0086–U+009F that according to Valid XML Characters on wikipedia are only valid in certain contexts and which the RFC mentions as discouraged but still valid characters.

public static class XmlTextExtentions
{
    private static readonly Dictionary<char, string> textEntities = new Dictionary<char, string> {
        { '&', "&amp;"}, { '<', "&lt;" }, { '>', "&gt;" }, 
        { '"', "&quot;" }, { '\'', "&apos;" }
    };
    public static string ToValidXmlString(this string str)
    {
        var stripped = str
            .Select((c,i) => new 
            { 
                c1 = c, 
                c2 = i + 1 < str.Length ? str[i+1]: default(char),
                v = XmlConvert.IsXmlChar(c),
                p = i + 1 < str.Length ? XmlConvert.IsXmlSurrogatePair(str[i + 1], c) : false,
                pp = i > 0 ? XmlConvert.IsXmlSurrogatePair(c, str[i - 1]) : false
            })
            .Aggregate("", (s, c) => {                  
                if (c.pp)
                    return s;
                if (textEntities.ContainsKey(c.c1))
                    s += textEntities[c.c1];
                else if (c.v)
                    s += c.c1.ToString();
                else if (c.p)
                    s += c.c1.ToString() + c.c2.ToString();
                return s;
            });
        return stripped;
    }
}

This passes all the XmlTextEncoder tests except for the one that expects it to strip DEL which XmlConvert.IsXmlChar, Wikipedia, and the spec marks as a valid (although discouraged) character.

Dolichocephalic answered 24/11, 2011 at 21:37 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.