The following solution removes any invalid XML characters, but it does so I think about as performantly as it could be done, and in particular, it does not allocate a new StringBuilder as well as a new string, not unless it is already determined that the string has any invalid characters in it. So the hot spot ends up being just a single for loop on the characters, with the check ending up being often no more than two greater than / lesser than numeric comparisons on each char. If none are found, it simply returns the original string. This is particularly helpful when the vast majority of strings are just fine to start with, it's nice to have these as in and out (with no wasted allocs etc) as quick as possible.
-- update --
See below how one can also directly write an XElement that has these invalid characters, though it uses this code --
Some of this code was influenced by Mr. Tom Bogle's solution here. See also on that same thread the helpful information in the post by superlogical. All of these, however, always instantiate a new StringBuilder and string still.
USAGE:
string xmlStrBack = XML.ToValidXmlCharactersString("any string");
TEST:
public static void TestXmlCleanser()
{
string badString = "My name is Inigo Montoya"; // you may not see it, but bad char is in 'MontXoya'
string goodString = "My name is Inigo Montoya!";
string back1 = XML.ToValidXmlCharactersString(badString); // fixes it
string back2 = XML.ToValidXmlCharactersString(goodString); // returns same string
XElement x1 = new XElement("test", back1);
XElement x2 = new XElement("test", back2);
XElement x3WithBadString = new XElement("test", badString);
string xml1 = x1.ToString();
string xml2 = x2.ToString().Print();
string xmlShouldFail = x3WithBadString.ToString();
}
// --- CODE --- (I have these methods in a static utility class called XML)
/// <summary>
/// Determines if any invalid XML 1.0 characters exist within the string,
/// and if so it returns a new string with the invalid chars removed, else
/// the same string is returned (with no wasted StringBuilder allocated, etc).
/// </summary>
/// <param name="s">Xml string.</param>
/// <param name="startIndex">The index to begin checking at.</param>
public static string ToValidXmlCharactersString(string s, int startIndex = 0)
{
int firstInvalidChar = IndexOfFirstInvalidXMLChar(s, startIndex);
if (firstInvalidChar < 0)
return s;
startIndex = firstInvalidChar;
int len = s.Length;
var sb = new StringBuilder(len);
if (startIndex > 0)
sb.Append(s, 0, startIndex);
for (int i = startIndex; i < len; i++)
if (IsLegalXmlChar(s[i]))
sb.Append(s[i]);
return sb.ToString();
}
/// <summary>
/// Gets the index of the first invalid XML 1.0 character in this string, else returns -1.
/// </summary>
/// <param name="s">Xml string.</param>
/// <param name="startIndex">Start index.</param>
public static int IndexOfFirstInvalidXMLChar(string s, int startIndex = 0)
{
if (s != null && s.Length > 0 && startIndex < s.Length) {
if (startIndex < 0) startIndex = 0;
int len = s.Length;
for (int i = startIndex; i < len; i++)
if (!IsLegalXmlChar(s[i]))
return i;
}
return -1;
}
/// <summary>
/// Indicates whether a given character is valid according to the XML 1.0 spec.
/// This code represents an optimized version of Tom Bogle's on SO:
/// https://mcmap.net/q/581812/-c-registry-to-xml-invalid-character-issue.
/// </summary>
public static bool IsLegalXmlChar(char c)
{
if (c > 31 && c <= 55295)
return true;
if (c < 32)
return c == 9 || c == 10 || c == 13;
return (c >= 57344 && c <= 65533) || c > 65535;
// final comparison is useful only for integral comparison, if char c -> int c, useful for utf-32 I suppose
//c <= 1114111 */ // impossible to get a code point bigger than 1114111 because Char.ConvertToUtf32 would have thrown an exception
}
======== ======== ========
Write XElement.ToString directly
======== ======== ========
First, the usage of this extension method:
string result = xelem.ToStringIgnoreInvalidChars();
-- Fuller test --
public static void TestXmlCleanser()
{
string badString = "My name is Inigo Montoya"; // you may not see it, but bad char is in 'MontXoya'
XElement x = new XElement("test", badString);
string xml1 = x.ToStringIgnoreInvalidChars();
//result: <test>My name is Inigo Montoya</test>
string xml2 = x.ToStringIgnoreInvalidChars(deleteInvalidChars: false);
//result: <test>My name is Inigo Montoya</test>
}
--- code ---
/// <summary>
/// Writes this XML to string while allowing invalid XML chars to either be
/// simply removed during the write process, or else encoded into entities,
/// instead of having an exception occur, as the standard XmlWriter.Create
/// XmlWriter does (which is the default writer used by XElement).
/// </summary>
/// <param name="xml">XElement.</param>
/// <param name="deleteInvalidChars">True to have any invalid chars deleted, else they will be entity encoded.</param>
/// <param name="indent">Indent setting.</param>
/// <param name="indentChar">Indent char (leave null to use default)</param>
public static string ToStringIgnoreInvalidChars(this XElement xml, bool deleteInvalidChars = true, bool indent = true, char? indentChar = null)
{
if (xml == null) return null;
StringWriter swriter = new StringWriter();
using (XmlTextWriterIgnoreInvalidChars writer = new XmlTextWriterIgnoreInvalidChars(swriter, deleteInvalidChars)) {
// -- settings --
// unfortunately writer.Settings cannot be set, is null, so we can't specify: bool newLineOnAttributes, bool omitXmlDeclaration
writer.Formatting = indent ? Formatting.Indented : Formatting.None;
if (indentChar != null)
writer.IndentChar = (char)indentChar;
// -- write --
xml.WriteTo(writer);
}
return swriter.ToString();
}
-- this uses the following XmlTextWritter --
public class XmlTextWriterIgnoreInvalidChars : XmlTextWriter
{
public bool DeleteInvalidChars { get; set; }
public XmlTextWriterIgnoreInvalidChars(TextWriter w, bool deleteInvalidChars = true) : base(w)
{
DeleteInvalidChars = deleteInvalidChars;
}
public override void WriteString(string text)
{
if (text != null && DeleteInvalidChars)
text = XML.ToValidXmlCharactersString(text);
base.WriteString(text);
}
}
<
that merely need to be escaped. – Dona