OpenXML escaping illegal characters
Asked Answered
G

3

8

I am doing some string replacement within a Word Docx file using OpenXML Power Tools and it is working as expected. However things break when I have invalid characters in the substitution such as ampersand, so for instance "Harry & Sally" will break and produce an invalid document. According to this post illegal characters need to be converted to xHHHH.

I am having trouble finding the contents to the OOXML clause mentioned in the post and hence escaping characters appropriately.

I am hoping someone either has some code or insights into exactly what characters need to be escaped. I was also hopeful OpenXML Power Tools could do this for me in some way, but I cannot seem to find anything in there either.

Gwynethgwynne answered 29/3, 2015 at 1:16 Comment(1)
If something does not work as expected, you probably have some code to show. Please add this code piece to your question.Emphasize
N
10

The specification is just talking about the standard set of characters that have to be escaped in XML. The XML specification mentioned in the linked post is the one from the W3C, found here.

There are five characters that need to be escaped anywhere they appear in XML data (names, values, etc) unless they are part of a CDATA section. According to Section 2.4:

The ampersand character (&) and the left angle bracket (<) must not appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they must be escaped using either numeric character references or the strings " &amp; " and " &lt; " respectively. The right angle bracket (>) may be represented using the string " &gt; ", and must, for compatibility, be escaped using either " &gt; " or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section.

To allow attribute values to contain both single and double quotes, the apostrophe or single-quote character (') may be represented as " &apos; ", and the double-quote character (") as " &quot; ".

In other words, escape the following characters:

' -> &apos;
" -> &quot;
> -> &gt;
< -> &lt;
& -> &amp;

Typically, you wouldn't encode these as xHHHH, you'd use the XML entities listed above, but either is allowed. You also don't need to encode quotes or the right-angle bracket in every case, only when they would otherwise represent XML syntax, but it's usually safer to do it all the time.

The XML specification also includes the list of every Unicode character that can appear in an XML document, in section 2.2:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

That list includes basically every Unicode character in the Basic plane (every one you're likely to run into), except for the control characters. Only the tab, CR, and LF characters are allowed -- any other character below ASCII 32 (space) needs to be escaped.

The big gap in the list (0xD800-0xDFF) is for surrogate encoding values, which shouldn't appear by themselves anyway, as they're not valid characters. The last two, 0xFFFE and 0xFFFF, are also not valid characters.

Nath answered 29/3, 2015 at 1:25 Comment(2)
You may want to add the missing semi colons for apostrophe and quoteGwynethgwynne
@Gwynethgwynne I've added themBroca
I
0

I created an extension method with help from Michael Edenfield's answer. Pretty self explanatory... just make sure you replace the ampersands first! Otherwise you will end up replacing your other escaped symbols by mistake.

public static string EscapeXmlCharacters(this string input)
{
    switch (input)
    {
        case null: return null;
        case "": return "";
        default:
        {
            input = input.Replace("&", "&amp;")
                .Replace("'", "&apos;")
                .Replace("\"", "&quot;")
                .Replace(">", "&gt;")
                .Replace("<", "&lt;");

            return input;
        }
    }
}

.NET Fiddle: https://dotnetfiddle.net/PCqffy

Iorgos answered 11/2, 2022 at 0:33 Comment(0)
L
0

I had the same issue today but replacing the & with &amp; or &#038; did not work. Id showed in my Excel file like Harry &amp; Sally I have checked with the open SDK productivity tool what Excel 2016 is really using when you save Harry & Sally into a cell. And it is stored in SharedStringTable as "Harry & Sally" So I end up with two differend funtions - one used for Numeric, Data and Time columns and one used for all Char / String columns, where I do not replace the &. And the result is as it should be so far. I have to export many data from an CRM system into several Excel files. The funtion in general is take from the documentation OpenSDKXML SDK. I only modified the regex string in the second function.

private static string ReplaceHexadecimalSymbols(string txt)
{
    string r = "[\x00-\x08\x0B\x0C\x0E-\x1F\x26]";
    return Regex.Replace(txt, r, "", RegexOptions.Compiled);
}

private static string ReplaceHexadecimalSymbolsinString(string txt)
{
    string r = "[\x00-\x08\x0B\x0C\x0E-\x1F]";
    return Regex.Replace(txt, r, "", RegexOptions.Compiled);
}
Lona answered 17/4, 2023 at 10:26 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.