What are invalid characters in XML
Asked Answered
D

16

291

I am working with some XML that holds strings like:

<node>This is a string</node>

Some of the strings that I am passing to the nodes will have characters like &, #, $, etc.:

<node>This is a string & so is this</node>

This is not valid due to &.

I cannot wrap these strings in CDATA as they need to be as they are. I tried looking for a list of characters that cannot be put in XML nodes without being in a CDATA.

Can someone point me in the direction of one or provide me with a list of illegal characters?

Dextrous answered 8/4, 2009 at 13:55 Comment(3)
Any valid reason for not using CDATA?Conversazione
Yes, I am passing the string to a CMS called Fatwire and the node with the data cannot be in a CDATA, i'm not sure why it's the way Fatwire works :(Dextrous
@Peter: How can I use CDATA in my case? stackoverflow.com/questions/6906705/…Codycoe
C
166

The only illegal characters are &, < and > (as well as " or ' in attributes, depending on which character is used to delimit the attribute value: attr="must use &quot; here, ' is allowed" and attr='must use &apos; here, " is allowed').

They're escaped using XML entities, in this case you want &amp; for &.

Really, though, you should use a tool or library that writes XML for you and abstracts this kind of thing away for you so you don't have to worry about it.

Calcicole answered 8/4, 2009 at 13:59 Comment(13)
[And ‘>’ doesn't always have to be escaped, either, although it's probably easiest to do so. It's only the string ‘]]>’ that's invalid (in element content). Bit of a strange wart really.]Straightlaced
Some controls characters are also not allowed. See my answer below.Buttaro
Actually that's not quite true. A number of lower ascii characters are invalid also. If you try to write 0x03 to an Xml document you get an error typically and if you do manage to properly escape it into an XML document, most viewers will complain about the invalid character. Edge case but it does happen.Dispense
0x1f is also an invalid character in XML 1.0. It's valid though in XML 1.1.Solenne
also 0x0B, or "\v", a vertical tab.Rehm
This answer is absolutely wrong. Here is my XML exception with 0x12 illegal character 'System.Xml.XmlException: '', hexadecimal value 0x12, is an invalid character'Tammietammuz
It's also wrong in the other direction; as well as missing every single illegal character, the characters it does claim are illegal are perfectly legal, albeit with special meaning in the context.Stonyhearted
In XML 1.0 there are many illegal characters. In fact even using a character entity for most control characters will cause an error when parsing.Tadpole
° this character is not serializable correctly in GWT thus not valid xml character.Filemon
Those are not the only illegal characters. For example a vertical tab character will break an xml parser.Unpaged
Also, > is not illegal as long as it does not follow ]].Vevina
"Really, though, you should use a tool or library that writes XML for you and abstracts this kind of thing away for you so you don't have to worry about it." Not really, xml is simple enough that you can write this on your own and be more in control of your data.Evesham
can this answer no longer be the accepted answer since the highest voted answer is betterSelfsupport
F
302

OK, let's separate the question of the characters that:

  1. aren't valid at all in any XML document.
  2. need to be escaped.

The answer provided by @dolmen in "https://mcmap.net/q/18226/-what-are-invalid-characters-in-xml/5110103#5110103" is still valid but needs to be updated with the XML 1.1 specification.

1. Invalid characters

The characters described here are all the characters that are allowed to be inserted in an XML document.

1.1. In XML 1.0

The global list of allowed characters is:

[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

Basically, the control characters and characters out of the Unicode ranges are not allowed. This means also that calling for example the character entity &#x3; is forbidden.

1.2. In XML 1.1

The global list of allowed characters is:

[2] Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

[2a] RestrictedChar ::= [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F]

This revision of the XML recommendation has extended the allowed characters so control characters are allowed, and takes into account a new revision of the Unicode standard, but these ones are still not allowed : NUL (x00), xFFFE, xFFFF...

However, the use of control characters and undefined Unicode char is discouraged.

It can also be noticed that all parsers do not always take this into account and XML documents with control characters may be rejected.

2. Characters that need to be escaped (to obtain a well-formed document):

The < must be escaped with a &lt; entity, since it is assumed to be the beginning of a tag.

The & must be escaped with a &amp; entity, since it is assumed to be the beginning a entity reference

The > should be escaped with &gt; entity. It is not mandatory -- it depends on the context -- but it is strongly advised to escape it.

The ' should be escaped with a &apos; entity -- mandatory in attributes defined within single quotes but it is strongly advised to always escape it.

The " should be escaped with a &quot; entity -- mandatory in attributes defined within double quotes but it is strongly advised to always escape it.

Fabozzi answered 26/1, 2015 at 14:59 Comment(4)
" but it is strongly advised to always escape it" - Could you clarify that bit? Who advises that, and why? (The way I see it, there's nothing wrong with using literal quotes wherever they are syntactically allowed.)Plastid
Shouldn't ' be escaped as &apos; instead ? w3.org/TR/REC-xml/#syntaxHumes
@Humes hey, I didn't notice the answer has been modified because I originally wrote to escape with &apos;. However both will work since numeric character reference are equally recognized w3.org/TR/REC-xml/#dt-charrefFabozzi
For 2.: see stackoverflow.com/questions/1091945/… for details. These 5 characters needn't always be escaped, just in some circumstances.Plural
B
179

The list of valid characters is in the XML specification:

Char       ::=      #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]  /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
Buttaro answered 24/2, 2011 at 20:34 Comment(7)
You should note that although they are legal characters, & < > " ' have to be escaped in certain contexts.Guanabana
"Legal" in this context means that their final decoded values are legal, not that they are legal in the stream. As above, some legal values have to be escaped in-stream.Schalles
I have an issue where 0x1c is an illegal character... Looking for a possibility in java how to avoid these....Clinquant
A nice overview which characters are valid and which are not can be found here validchar.com/d/xml10/xml10_namestartAlric
@xamde That list is nice, but it only shows the characters that may be used to start an XML element. The issue at hand is which characters are valid in an XML file in general. There are certain characters that are not allowed anywhere.Kailey
My answer is more complete, more visual, and still gives credit to others. - I wonder why you downvoted it ?Crinoline
I just ran a test where #x1 got written to an XML element perfectly legally, escaped as &#x1;. So what's illegal about it?Retrogression
C
166

The only illegal characters are &, < and > (as well as " or ' in attributes, depending on which character is used to delimit the attribute value: attr="must use &quot; here, ' is allowed" and attr='must use &apos; here, " is allowed').

They're escaped using XML entities, in this case you want &amp; for &.

Really, though, you should use a tool or library that writes XML for you and abstracts this kind of thing away for you so you don't have to worry about it.

Calcicole answered 8/4, 2009 at 13:59 Comment(13)
[And ‘>’ doesn't always have to be escaped, either, although it's probably easiest to do so. It's only the string ‘]]>’ that's invalid (in element content). Bit of a strange wart really.]Straightlaced
Some controls characters are also not allowed. See my answer below.Buttaro
Actually that's not quite true. A number of lower ascii characters are invalid also. If you try to write 0x03 to an Xml document you get an error typically and if you do manage to properly escape it into an XML document, most viewers will complain about the invalid character. Edge case but it does happen.Dispense
0x1f is also an invalid character in XML 1.0. It's valid though in XML 1.1.Solenne
also 0x0B, or "\v", a vertical tab.Rehm
This answer is absolutely wrong. Here is my XML exception with 0x12 illegal character 'System.Xml.XmlException: '', hexadecimal value 0x12, is an invalid character'Tammietammuz
It's also wrong in the other direction; as well as missing every single illegal character, the characters it does claim are illegal are perfectly legal, albeit with special meaning in the context.Stonyhearted
In XML 1.0 there are many illegal characters. In fact even using a character entity for most control characters will cause an error when parsing.Tadpole
° this character is not serializable correctly in GWT thus not valid xml character.Filemon
Those are not the only illegal characters. For example a vertical tab character will break an xml parser.Unpaged
Also, > is not illegal as long as it does not follow ]].Vevina
"Really, though, you should use a tool or library that writes XML for you and abstracts this kind of thing away for you so you don't have to worry about it." Not really, xml is simple enough that you can write this on your own and be more in control of your data.Evesham
can this answer no longer be the accepted answer since the highest voted answer is betterSelfsupport
S
64

This is a C# code to remove the XML invalid characters from a string and return a new valid string.

public static string CleanInvalidXmlChars(string text) 
{ 
    // From xml spec valid chars: 
    // #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]     
    // any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. 
    string re = @"[^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u10000-\u10FFFF]"; 
    return Regex.Replace(text, re, ""); 
}
Spanish answered 14/1, 2013 at 17:31 Comment(7)
For Java, the regex pattern would be the same. And then you can use the method called replaceAll in the class String that expects a regex pattern as parameter. Check this: docs.oracle.com/javase/6/docs/api/java/lang/…Spanish
I have such invalid characters in my string: SUSITARIMO D&#x5;L DARBO SUTARTIES This code doesn't remove &#x5; So the xml document fails to init.Broddy
I believe you cannot just put this pattern into a .NET regex constructor. I don't think it recognizes \u10000 and \u10FFFF as single characters as they require two utf-16 char instances each, and according to the docs there might not be more that 4 digits. [\u10000-\u10FFFF] is most likely parsed as [\u1000, 0-\u10FF, F, F] which is weird looking but legal.Shutin
A better implementation that takes care of the utf-16 characters can be found here: https://mcmap.net/q/18406/-escape-invalid-xml-characters-in-cForeshadow
be careful to use this method, your valid UTF character will also be replaced with empty string, causing unexpected result on applicationWheelbarrow
Use XmlConvert.VerifyXmlChars check instead. var strInput = "बिजय Bijay"; var strOutput = ""; try { strOutput = XmlConvert.VerifyXmlChars(strInput); } catch { strOutput = new string(strInput.Where(ch => XmlConvert.IsXmlChar(ch)).ToArray()); }Wheelbarrow
Java only supports 16bits so your regex pattern won't work for Java. Here is the same solution for Java https://mcmap.net/q/18226/-what-are-invalid-characters-in-xmlJollity
L
18

The predeclared characters are:

& < > " '

See "What are the special characters in XML?" for more information.

Linares answered 8/4, 2009 at 13:59 Comment(1)
Wrong. These are not all invalid. Only & and < are always invalid in the text.Vevina
C
12

In addition to potame's answer, if you do want to escape using a CDATA block.

If you put your text in a CDATA block then you don't need to use escaping. In that case you can use all characters in the following range:

graphical representation of possible characters

Note: On top of that, you're not allowed to use the ]]> character sequence. Because it would match the end of the CDATA block.

If there are still invalid characters (e.g. control characters), then probably it's better to use some kind of encoding (e.g. base64).

Crinoline answered 30/1, 2017 at 14:7 Comment(2)
Wether in a CDATA block or not, some characters are forbidden in XML.Buttaro
exactly, isn't that what I wrote ? quote: "all characters in the following range". By which I mean, only the characters in this specific range. Other characters are not allowed. - fully agree ; but I don't understand the downvote. - no hard feelings though.Crinoline
B
10

Another way to remove incorrect XML chars in C# is using XmlConvert.IsXmlChar (Available since .NET Framework 4.0)

public static string RemoveInvalidXmlChars(string content)
{
   return new string(content.Where(ch => System.Xml.XmlConvert.IsXmlChar(ch)).ToArray());
}

or you may check that all characters are XML-valid:

public static bool CheckValidXmlChars(string content)
{
   return content.All(ch => System.Xml.XmlConvert.IsXmlChar(ch));
}

.Net Fiddle

For example, the vertical tab symbol (\v) is not valid for XML, it is valid UTF-8, but not valid XML 1.0, and even many libraries (including libxml2) miss it and silently output invalid XML.

Berkshire answered 20/2, 2018 at 19:33 Comment(0)
A
6

Another easy way to escape potentially unwanted XML / XHTML chars in C# is:

WebUtility.HtmlEncode(stringWithStrangeChars)
Aires answered 19/2, 2014 at 10:1 Comment(3)
Invalid charactersButtaro
He wrote Xml not Html.Rawdin
This works with < > & " ' but not control charactersBearded
V
3

For Java folks, Apache has a utility class (StringEscapeUtils) that has a helper method escapeXml which can be used for escaping characters in a string using XML entities.

Vermont answered 18/9, 2014 at 12:43 Comment(0)
T
3

"XmlWriter and lower ASCII characters" worked for me

string code = Regex.Replace(item.Code, @"[\u0000-\u0008,\u000B,\u000C,\u000E-\u001F]", "");
Theran answered 4/7, 2018 at 4:43 Comment(0)
V
2

In summary, valid characters in the text are:

  • tab, line-feed and carriage-return.
  • all non-control characters are valid except & and <.
  • > is not valid if following ]].

Sections 2.2 and 2.4 of the XML specification provide the answer in detail:

Characters

Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646

Character data

The ampersand character (&) and the left angle bracket (<) must not appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they must be escaped using either numeric character references or the strings " & " and " < " respectively. The right angle bracket (>) may be represented using the string " > ", and must, for compatibility, be escaped using either " > " or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section.

Vevina answered 24/10, 2018 at 14:41 Comment(0)
E
1

In the Woodstox XML processor, invalid characters are classified by this code:

if (c == 0) {
    throw new IOException("Invalid null character in text to output");
}
if (c < ' ' || (c >= 0x7F && c <= 0x9F)) {
    String msg = "Invalid white space character (0x" + Integer.toHexString(c) + ") in text to output";
    if (mXml11) {
        msg += " (can only be output using character entity)";
    }
    throw new IOException(msg);
}
if (c > 0x10FFFF) {
    throw new IOException("Illegal unicode character point (0x" + Integer.toHexString(c) + ") to output; max is 0x10FFFF as per RFC");
}
/*
 * Surrogate pair in non-quotable (not text or attribute value) content, and non-unicode encoding (ISO-8859-x,
 * Ascii)?
 */
if (c >= SURR1_FIRST && c <= SURR2_LAST) {
    throw new IOException("Illegal surrogate pair -- can only be output via character entities, which are not allowed in this content");
}
throw new IOException("Invalid XML character (0x"+Integer.toHexString(c)+") in text to output");

Source from here

Exclude answered 3/12, 2014 at 10:27 Comment(0)
T
1
ampersand (&) is escaped to &amp;

double quotes (") are escaped to &quot;

single quotes (') are escaped to &apos; 

less than (<) is escaped to &lt; 

greater than (>) is escaped to &gt;

In C#, use System.Security.SecurityElement.Escape or System.Net.WebUtility.HtmlEncode to escape these illegal characters.

string xml = "<node>it's my \"node\" & i like it 0x12 x09 x0A  0x09 0x0A <node>";
string encodedXml1 = System.Security.SecurityElement.Escape(xml);
string encodedXml2= System.Net.WebUtility.HtmlEncode(xml);


encodedXml1
"&lt;node&gt;it&apos;s my &quot;node&quot; &amp; i like it 0x12 x09 x0A  0x09 0x0A &lt;node&gt;"

encodedXml2
"&lt;node&gt;it&#39;s my &quot;node&quot; &amp; i like it 0x12 x09 x0A  0x09 0x0A &lt;node&gt;"
Tm answered 17/11, 2016 at 17:55 Comment(0)
J
0

Remove invalid characters(restricted + discouraged) in xml document with Java

I had some trouble making a pattern from the xml 1.1 specification because it has characters that exceed 16 bits.

The issue with Java is, a char takes always 16 bits which means after the 64K first code points of Unicode, (i.e. range 0x0000 to 0xFFFF), the code points are non-BMP ("rare") Unicode character and can't be expressed by \u literal. Luckily starting with Java 7, we can use \x{foo} where foo is the hexadecimal representation of the code point.

Finally the below code removes all Restricted Characters + Discouraged Characters from a text.

static Pattern XMLCharInvalidPattern =
      Pattern.compile(
          "[\\x{1}-\\x{8}]|[\\x{B}-\\x{C}]|[\\x{E}-\\x{1F}]|[\\x{7F}-\\x{84}]|[\\x{86}-\\x{9F}]|[\\x{FDD0}-\\x{FDDF}]|[\\x{1FFFE}-\\x{1FFFF}]|[\\x{2FFFE}-\\x{2FFFF}]|[\\x{3FFFE}-\\x{3FFFF}]|[\\x{4FFFE}-\\x{4FFFF}]|[\\x{5FFFE}-\\x{5FFFF}]|[\\x{6FFFE}-\\x{6FFFF}]|[\\x{7FFFE}-\\x{7FFFF}]|[\\x{8FFFE}-\\x{8FFFF}]|[\\x{9FFFE}-\\x{9FFFF}]|[\\x{AFFFE}-\\x{AFFFF}]|[\\x{BFFFE}-\\x{BFFFF}]|[\\x{CFFFE}-\\x{CFFFF}]|[\\x{DFFFE}-\\x{DFFFF}]|[\\x{EFFFE}-\\x{EFFFF}]|[\\x{FFFFE}-\\x{FFFFF}]|[\\x{10FFFE}-\\x{10FFFF}]");

String invalidXmlText = "he\u0001ll\u0003o wo\uFDD0rl\u0084d";

String cleanXmlText = XMLCharInvalidPattern.matcher(invalidXmlText).replaceAll("");

// cleanXmlText = hello world
Jollity answered 28/10, 2023 at 1:28 Comment(0)
A
-2

Anyone tried this System.Security.SecurityElement.Escape(yourstring)? This will replace invalid XML characters in a string with their valid equivalent.

Angel answered 23/3, 2018 at 10:40 Comment(0)
S
-5

For XSL (on really lazy days) I use:

capture="&amp;(?!amp;)" capturereplace="&amp;amp;"

to translate all &-signs that aren't follwed på amp; to proper ones.

We have cases where the input is in CDATA but the system which uses the XML doesn't take it into account. It's a sloppy fix, beware...

Septuor answered 17/6, 2013 at 15:36 Comment(1)
If it's sloppy, is it really necessary to post it here?Buttaro

© 2022 - 2024 — McMap. All rights reserved.