Dealing with invalid XML hexadecimal characters
Asked Answered
D

8

24

I'm trying to send an XML document over the wire but receiving the following exception:

"MY LONG EMAIL STRING" was specified for the 'Body' element. ---> System.ArgumentException: '', hexadecimal value 0x02, is an invalid character.
   at System.Xml.XmlUtf8RawTextWriter.InvalidXmlChar(Int32 ch, Byte* pDst, Boolean entitize)
   at System.Xml.XmlUtf8RawTextWriter.WriteElementTextBlock(Char* pSrc, Char* pSrcEnd)
   at System.Xml.XmlUtf8RawTextWriter.WriteString(String text)
   at System.Xml.XmlUtf8RawTextWriterIndent.WriteString(String text)
   at System.Xml.XmlRawWriter.WriteValue(String value)
   at System.Xml.XmlWellFormedWriter.WriteValue(String value)
   at Microsoft.Exchange.WebServices.Data.EwsServiceXmlWriter.WriteValue(String value, String name)
   --- End of inner exception stack trace ---

I don't have any control over what I attempt to send because the string is gathered from an email. How can I encode my string so it's valid XML while keeping the illegal characters?

I'd like to keep the original characters one way or another.

Deserve answered 17/11, 2011 at 16:31 Comment(1)
Depends whether the illegal characters are things like x0 that XML can't handle at all, or things like < that merely need to be escaped.Dona
S
16
byte[] toEncodeAsBytes
            = System.Text.ASCIIEncoding.ASCII.GetBytes(toEncode);
      string returnValue
            = System.Convert.ToBase64String(toEncodeAsBytes);

is one way of doing this

Stolid answered 17/11, 2011 at 16:35 Comment(0)
W
25

The following code removes XML invalid characters from a string and returns a new string without them:

public static string CleanInvalidXmlChars(string text) 
{ 
     // From xml spec valid chars: 
     // #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]     
     // any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. 
     string re = @"[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF]"; 
     return Regex.Replace(text, re, ""); 
}
Whyte answered 14/1, 2013 at 16:54 Comment(1)
This doesn't work correctly because the final x10FFFF is not escaped. See this answer for a better regex: https://mcmap.net/q/342650/-unicode-regex-invalid-xml-charactersTerranceterrane
S
16
byte[] toEncodeAsBytes
            = System.Text.ASCIIEncoding.ASCII.GetBytes(toEncode);
      string returnValue
            = System.Convert.ToBase64String(toEncodeAsBytes);

is one way of doing this

Stolid answered 17/11, 2011 at 16:35 Comment(0)
A
11

Another way to remove incorrect XML chars in C# with using XmlConvert.IsXmlChar Method (Available since .NET Framework 4.0)

public static string RemoveInvalidXmlChars(string content)
{
   return new string(content.Where(ch => System.Xml.XmlConvert.IsXmlChar(ch)).ToArray());
}

.Net Fiddle - https://dotnetfiddle.net/v1TNus

For example, the vertical tab symbol (\v) is not valid for XML, it is valid UTF-8, but not valid XML 1.0, and even many libraries (including libxml2) miss it and silently output invalid XML.

Aldenalder answered 20/2, 2018 at 20:0 Comment(0)
C
9

Work for me:

XmlWriterSettings xmlWriterSettings = new XmlWriterSettings { Encoding = Encoding.UTF8, CheckCharacters = false };
Chiliasm answered 1/10, 2015 at 13:34 Comment(2)
CheckCharacters = true on the settings did the trick for me. Thanks!Offstage
where can i bout it ?? [#57287928Signora
W
6

The following solution removes any invalid XML characters, but it does so I think about as performantly as it could be done, and in particular, it does not allocate a new StringBuilder as well as a new string, not unless it is already determined that the string has any invalid characters in it. So the hot spot ends up being just a single for loop on the characters, with the check ending up being often no more than two greater than / lesser than numeric comparisons on each char. If none are found, it simply returns the original string. This is particularly helpful when the vast majority of strings are just fine to start with, it's nice to have these as in and out (with no wasted allocs etc) as quick as possible.

-- update --

See below how one can also directly write an XElement that has these invalid characters, though it uses this code --

Some of this code was influenced by Mr. Tom Bogle's solution here. See also on that same thread the helpful information in the post by superlogical. All of these, however, always instantiate a new StringBuilder and string still.

USAGE:

    string xmlStrBack = XML.ToValidXmlCharactersString("any string");

TEST:

    public static void TestXmlCleanser()
    {
        string badString = "My name is Inigo Montoya"; // you may not see it, but bad char is in 'MontXoya'
        string goodString = "My name is Inigo Montoya!";

        string back1 = XML.ToValidXmlCharactersString(badString); // fixes it
        string back2 = XML.ToValidXmlCharactersString(goodString); // returns same string

        XElement x1 = new XElement("test", back1);
        XElement x2 = new XElement("test", back2);
        XElement x3WithBadString = new XElement("test", badString);

        string xml1 = x1.ToString();
        string xml2 = x2.ToString().Print();

        string xmlShouldFail = x3WithBadString.ToString();
    }

// --- CODE --- (I have these methods in a static utility class called XML)

    /// <summary>
    /// Determines if any invalid XML 1.0 characters exist within the string,
    /// and if so it returns a new string with the invalid chars removed, else 
    /// the same string is returned (with no wasted StringBuilder allocated, etc).
    /// </summary>
    /// <param name="s">Xml string.</param>
    /// <param name="startIndex">The index to begin checking at.</param>
    public static string ToValidXmlCharactersString(string s, int startIndex = 0)
    {
        int firstInvalidChar = IndexOfFirstInvalidXMLChar(s, startIndex);
        if (firstInvalidChar < 0)
            return s;

        startIndex = firstInvalidChar;

        int len = s.Length;
        var sb = new StringBuilder(len);

        if (startIndex > 0)
            sb.Append(s, 0, startIndex);

        for (int i = startIndex; i < len; i++)
            if (IsLegalXmlChar(s[i]))
                sb.Append(s[i]);

        return sb.ToString();
    }

    /// <summary>
    /// Gets the index of the first invalid XML 1.0 character in this string, else returns -1.
    /// </summary>
    /// <param name="s">Xml string.</param>
    /// <param name="startIndex">Start index.</param>
    public static int IndexOfFirstInvalidXMLChar(string s, int startIndex = 0)
    {
        if (s != null && s.Length > 0 && startIndex < s.Length) {

            if (startIndex < 0) startIndex = 0;
            int len = s.Length;

            for (int i = startIndex; i < len; i++)
                if (!IsLegalXmlChar(s[i]))
                    return i;
        }
        return -1;
    }

    /// <summary>
    /// Indicates whether a given character is valid according to the XML 1.0 spec.
    /// This code represents an optimized version of Tom Bogle's on SO: 
    /// https://mcmap.net/q/581812/-c-registry-to-xml-invalid-character-issue.
    /// </summary>
    public static bool IsLegalXmlChar(char c)
    {
        if (c > 31 && c <= 55295)
            return true;
        if (c < 32)
            return c == 9 || c == 10 || c == 13;
        return (c >= 57344 && c <= 65533) || c > 65535;
        // final comparison is useful only for integral comparison, if char c -> int c, useful for utf-32 I suppose
        //c <= 1114111 */ // impossible to get a code point bigger than 1114111 because Char.ConvertToUtf32 would have thrown an exception
    }

======== ======== ========

Write XElement.ToString directly

======== ======== ========

First, the usage of this extension method:

string result = xelem.ToStringIgnoreInvalidChars();

-- Fuller test --

    public static void TestXmlCleanser()
    {
        string badString = "My name is Inigo Montoya"; // you may not see it, but bad char is in 'MontXoya'

        XElement x = new XElement("test", badString);

        string xml1 = x.ToStringIgnoreInvalidChars();                               
        //result: <test>My name is Inigo Montoya</test>

        string xml2 = x.ToStringIgnoreInvalidChars(deleteInvalidChars: false);
        //result: <test>My name is Inigo Mont&#x1E;oya</test>
    }

--- code ---

    /// <summary>
    /// Writes this XML to string while allowing invalid XML chars to either be
    /// simply removed during the write process, or else encoded into entities, 
    /// instead of having an exception occur, as the standard XmlWriter.Create 
    /// XmlWriter does (which is the default writer used by XElement).
    /// </summary>
    /// <param name="xml">XElement.</param>
    /// <param name="deleteInvalidChars">True to have any invalid chars deleted, else they will be entity encoded.</param>
    /// <param name="indent">Indent setting.</param>
    /// <param name="indentChar">Indent char (leave null to use default)</param>
    public static string ToStringIgnoreInvalidChars(this XElement xml, bool deleteInvalidChars = true, bool indent = true, char? indentChar = null)
    {
        if (xml == null) return null;

        StringWriter swriter = new StringWriter();
        using (XmlTextWriterIgnoreInvalidChars writer = new XmlTextWriterIgnoreInvalidChars(swriter, deleteInvalidChars)) {

            // -- settings --
            // unfortunately writer.Settings cannot be set, is null, so we can't specify: bool newLineOnAttributes, bool omitXmlDeclaration
            writer.Formatting = indent ? Formatting.Indented : Formatting.None;

            if (indentChar != null)
                writer.IndentChar = (char)indentChar;

            // -- write --
            xml.WriteTo(writer); 
        }

        return swriter.ToString();
    }

-- this uses the following XmlTextWritter --

public class XmlTextWriterIgnoreInvalidChars : XmlTextWriter
{
    public bool DeleteInvalidChars { get; set; }

    public XmlTextWriterIgnoreInvalidChars(TextWriter w, bool deleteInvalidChars = true) : base(w)
    {
        DeleteInvalidChars = deleteInvalidChars;
    }

    public override void WriteString(string text)
    {
        if (text != null && DeleteInvalidChars)
            text = XML.ToValidXmlCharactersString(text);
        base.WriteString(text);
    }
}
Welcher answered 2/7, 2015 at 22:40 Comment(0)
N
3

I'm on the receiving end of @parapurarajkumar's solution, where the illegal characters are being properly loaded into XmlDocument, but breaking XmlWriter when I'm trying to save the output.

My Context

I'm looking at exception/error logs from the website using Elmah. Elmah returns the state of the server at the time of the exception, in the form of a large XML document. For our reporting engine I pretty-print the XML with XmlWriter.

During a website attack, I noticed that some xmls weren't parsing and was receiving this '.', hexadecimal value 0x00, is an invalid character. exception.

NON-RESOLUTION: I converted the document to a byte[] and sanitized it of 0x00, but it found none.

When I scanned the xml document, I found the following:

...
<form>
...
<item name="SomeField">
   <value
     string="C:\boot.ini&#x0;.htm" />
 </item>
...

There was the nul byte encoded as an html entity &#x0; !!!

RESOLUTION: To fix the encoding, I replaced the &#x0; value before loading it into my XmlDocument, because loading it will create the nul byte and it will be difficult to sanitize it from the object. Here's my entire process:

XmlDocument xml = new XmlDocument();
details.Xml = details.Xml.Replace("&#x0;", "[0x00]");  // in my case I wanted to see it, otherwise just replace with ""
xml.LoadXml(details.Xml);

string formattedXml = null;

// I stuff this all in a helper function, but put it in-line for this example
StringBuilder sb = new StringBuilder();
XmlWriterSettings settings = new XmlWriterSettings {
    OmitXmlDeclaration = true,
    Indent = true,
    IndentChars = "\t",
    NewLineHandling = NewLineHandling.None,
};
using (XmlWriter writer = XmlWriter.Create(sb, settings)) {
    xml.Save(writer);
    formattedXml = sb.ToString();
}

LESSON LEARNED: sanitize for illegal bytes using the associated html entity, if your incoming data is html encoded on entry.

Naima answered 24/10, 2013 at 17:36 Comment(0)
H
1

There is a generic solution that works nicely:

public class XmlTextTransformWriter : System.Xml.XmlTextWriter
{
    public XmlTextTransformWriter(System.IO.TextWriter w) : base(w) { }
    public XmlTextTransformWriter(string filename, System.Text.Encoding encoding) : base(filename, encoding) { }
    public XmlTextTransformWriter(System.IO.Stream w, System.Text.Encoding encoding) : base(w, encoding) { }

    public Func<string, string> TextTransform = s => s;

    public override void WriteString(string text)
    {
        base.WriteString(TextTransform(text));
    }

    public override void WriteCData(string text)
    {
        base.WriteCData(TextTransform(text));
    }

    public override void WriteComment(string text)
    {
        base.WriteComment(TextTransform(text));
    }

    public override void WriteRaw(string data)
    {
        base.WriteRaw(TextTransform(data));
    }

    public override void WriteValue(string value)
    {
        base.WriteValue(TextTransform(value));
    }
}

Once this is in place, you can then create your override of THIS as follows:

public class XmlRemoveInvalidCharacterWriter : XmlTextTransformWriter
{
    public XmlRemoveInvalidCharacterWriter(System.IO.TextWriter w) : base(w) { SetTransform(); }
    public XmlRemoveInvalidCharacterWriter(string filename, System.Text.Encoding encoding) : base(filename, encoding) { SetTransform(); }
    public XmlRemoveInvalidCharacterWriter(System.IO.Stream w, System.Text.Encoding encoding) : base(w, encoding) { SetTransform(); }

    void SetTransform()
    {
        TextTransform = XmlUtil.RemoveInvalidXmlChars;
    }
}

where XmlUtil.RemoveInvalidXmlChars is defined as follows:

    public static string RemoveInvalidXmlChars(string content)
    {
        if (content.Any(ch => !System.Xml.XmlConvert.IsXmlChar(ch)))
            return new string(content.Where(ch => System.Xml.XmlConvert.IsXmlChar(ch)).ToArray());
        else
            return content;
    }
Huoh answered 26/11, 2019 at 14:20 Comment(0)
H
0

Can't the string be cleaned with:

System.Net.WebUtility.HtmlDecode()

?

Hyperemia answered 16/4, 2015 at 7:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.