What characters do I need to escape in XML documents?
Asked Answered
B

10

1132

What characters must be escaped in XML documents, or where could I find such a list?

Balsa answered 7/7, 2009 at 12:7 Comment(6)
Example: <company>AT&amp;T</company>Complacency
See Simplified XML Escaping below for a concise and easily remembered guide that I've distilled from primary sources (W3C Extensible Markup Language (XML) 1.0 (Fifth Edition)).Gauzy
Literally none of the answers here are correct. You also must escape many various control characters in XML 1.1.Mllly
@JasonC: Understanding the question as intended rather than literally is ideal. If you feel future readers would benefit from an elaboration of how to specify control characters in XML, please elaborate in an answer. Thanks.Gauzy
@Gauzy With the question being interpreted as intended, literally none of the answers here are correct. You also must escape many various control characters in XML 1.1, as outlined here. See also XML 1.1 §4.1, §4.4, §4.6, and Appx. C for specific details and restrictions.Mllly
@JasonC: I've updated Simplified XML Escaping below to address your point. Let me know if you have further recommendations. Thanks.Gauzy
T
1641

If you use an appropriate class or library, they will do the escaping for you. Many XML issues are caused by string concatenation.

XML escape characters

There are only five:

"   &quot;
'   &apos;
<   &lt;
>   &gt;
&   &amp;

Escaping characters depends on where the special character is used.

The examples can be validated at the W3C Markup Validation Service.

Text

The safe way is to escape all five characters in text. However, the three characters ", ' and > needn't be escaped in text:

<?xml version="1.0"?>
<valid>"'></valid>

Attributes

The safe way is to escape all five characters in attributes. However, the > character needn't be escaped in attributes:

<?xml version="1.0"?>
<valid attribute=">"/>

The ' character needn't be escaped in attributes if the quotes are ":

<?xml version="1.0"?>
<valid attribute="'"/>

Likewise, the " needn't be escaped in attributes if the quotes are ':

<?xml version="1.0"?>
<valid attribute='"'/>

Comments

All five special characters must not be escaped in comments:

<?xml version="1.0"?>
<valid>
<!-- "'<>& -->
</valid>

CDATA

All five special characters must not be escaped in CDATA sections:

<?xml version="1.0"?>
<valid>
<![CDATA["'<>&]]>
</valid>

Processing instructions

All five special characters must not be escaped in XML processing instructions:

<?xml version="1.0"?>
<?process <"'&> ?>
<valid/>

XML vs. HTML

HTML has its own set of escape codes which cover a lot more characters.

Turtle answered 7/7, 2009 at 12:9 Comment(12)
But as for HTML, we would only have to escape the five above too right?Aphonia
@Pacerier, I beg you not to write your own XML/HTML escaping code. Use a library function or you're bound to miss a special case.Responsible
Also for line breaks you need to use &#xA; &#xD; and &#x9; for tab, if you need these characters in an attribute.Shakespeare
Carriage Return &#xD is only included for backward-compatibility as noted in the section that precedes the one linked to by MicSim. Avoid using it as it is ether removed or replaced by &#xA.Infinitude
If you're going to do a Find/Replace on these, just remember to do the &amp; replacement before the others.Huneycutt
@Huneycutt I was just about to mention the exact same thing - or else all other replaced characters will be corrupted, and things like &quot; will be changed to &amp;quot;Beaujolais
Notice, that in HTML you actually just have to escape < and &. While the other three are also defined, there is actually no need to escape them within valid XMLNiemeyer
From Wikipedia: "All permitted Unicode characters may be represented with a numeric character reference." So there are a lot more than 5.Eichman
@Niemeyer I found the same to be true in my testing. I escaped all 5 originally to be safe, even though the ampersand alone was the original target for the bug. Upon further testing I was finding that the apostrophe for example was making it through to the application with no problems at all.Midsummer
You can escape any characters you want -- even every character. Only less-than, ampersand, and the sequence "]]>" actually matter if you're trying to turn an arbitrary string into XML content (that is, you don't want any tags or other XML constructs to be detected within it). "]]>" is uncommon, so some people ignore it; or you can change the ">" in it to &gt; or &#62; or &#x3e;Putman
Non-printable characters seem to be not legal in an XML file, too.Mounts
@Responsible I looked up source code of "xml-escape" lib for js. It has 22 lines of code and covers exactly 5 chars. Seems trivial enough. But that's just the actual XML. HTML is different animal altogether.Horning
A
99

Perhaps this will help:

List of XML and HTML character entity references:

In SGML, HTML and XML documents, the logical constructs known as character data and attribute values consist of sequences of characters, in which each character can manifest directly (representing itself), or can be represented by a series of characters called a character reference, of which there are two types: a numeric character reference and a character entity reference. This article lists the character entity references that are valid in HTML and XML documents.

That article lists the following five predefined XML entities:

quot  "
amp   &
apos  '
lt    <
gt    >
Autochthonous answered 7/7, 2009 at 12:9 Comment(0)
G
99

New, simplified answer to an old, commonly asked question...

Simplified XML Escaping (prioritized, 100% complete)

  1. Always (90% important to remember)

    • Escape < as &lt; unless < is starting a <tag/> or other markup.
    • Escape & as &amp; unless & is starting an &entity;.
  2. Attribute Values (9% important to remember)

    • attr=" 'Single quotes' are ok within double quotes."
    • attr=' "Double quotes" are ok within single quotes.'
    • Escape " as &quot; and ' as &apos; otherwise.
  3. Comments, CDATA, and Processing Instructions (0.9% important to remember)

    • <!-- Within comments --> nothing has to be escaped but no -- strings are allowed.
    • <![CDATA[ Within CDATA ]]> nothing has to be escaped, but no ]]> strings are allowed.
    • <?PITarget Within PIs ?> nothing has to be escaped, but no ?> strings are allowed.
  4. Esoterica (0.1% important to remember)

Gauzy answered 9/10, 2017 at 1:54 Comment(9)
One other rule worth noting: ]]> must be escaped as ]]&gt;, even when not in a CDATA section. The easiest way of achieving that may be to always escape > as &gt;.Theo
Thanks, @MichaelKay. I've incorporated your helpful note about ]]> but chose to relegate it to esoterica rather than suggesting that > always be escaped (which it needn't be, as you know). My goal here to make the XML escaping rules easily remembered and 100% accurate.Gauzy
The above answers including accepted one mention all five characters should be escaped inside attributes. Do you have any reference to XML standard to back what you are saying as your answer logically seems to be the correct one?Overshadow
@RomanSusi: Yes, many other answers contain errors or overgeneralizations ("The safe way...") based on hearsay, misinterpretation, or misunderstanding of the official XML BNF. My answer is (a) 100% justified by W3C XML Recommendation; see the many linked references to the official BNF, and (b) organized in a concise, logical, and easily remembered progression of those requirements.Gauzy
@RomanSusi: The specific statement that "all five characters should be escaped inside attributes" is sloppy guidance unsupported by the official BNF rule for AttValue cited in my answer via a link on 2. Attribute Values.Gauzy
Ah ok... I was actually looking whether & needs to be escaped, so missed Always part, thanks!Overshadow
I think I should change my future first child name from Felipe to ";'Felipe]]><PLAINTEXT> <!-- and see what happens to most websitesGlazer
and in the event of second child, maybe her name can be just </script>\0Glazer
@FelipeValdes: Conformant XML parsers will reject documents as not well-formed when they contain ]]> anywhere other than ending a CDATA section or the null char anywhere in a document. What browsers will do over time and the impact on your childrens' development are less clear.Gauzy
K
84

According to the specifications of the World Wide Web Consortium (w3C), there are 5 characters that must not appear in their literal form in an XML document, except when used as markup delimiters or within a comment, a processing instruction, or a CDATA section. In all the other cases, these characters must be replaced either using the corresponding entity or the numeric reference according to the following table:

Original CharacterXML entity replacementXML numeric replacement
<                              &lt;                                    &#60;                                    
>                              &gt;                                   &#62;                                    
"                               &quot;                               &#34;                                    
&                              &amp;                               &#38;                                    
'                               &apos;                               &#39;                                    

Notice that the aforementioned entities can be used also in HTML, with the exception of &apos;, that was introduced with XHTML 1.0 and is not declared in HTML 4. For this reason, and to ensure retro-compatibility, the XHTML specification recommends the use of &#39; instead.

Kimon answered 3/7, 2013 at 12:38 Comment(4)
XML predefines those five entities, but it absolutely does NOT specify that you can't use any of those five characters in their literal form. < and & have to be escaped everywhere (except CDATA). " and ' only have to be escaped in attribute values, and only if the corresponding quote character is the same. And > never actually has to be escaped.Indestructible
As written above, < > " & ' do not have to be escaped when used as markup delimiters or within a comment, a processing instruction, or a CDATA section. i.e. when you use < > as an XML tag you don't escape it. Same thing for a comment (would you escape an & in a commented line of a XML file? You don't need to, and your XML is still valid if you don't). This is clearly specified in the official recommendations for XML by W3C.Kimon
@ShaunMcCance > must be escaped if it follows ]] within content, unless it's intended to be part of the ]]> delimiter that indicates the end of a CDATA section.Adonis
Not to be a necromancer, but @Kimon is incorrect in saying that these characters MUST be entitized in content. See section 2.4 at w3.org/TR/REC-xml/#NT-CharData. The TL;DR version of that is that in chardata element content, &amp; and &lt; have to always be entitized. The &gt; character MAY be entitized, although it MUST be when appearing in the literal string “]]>” because otherwise that will be read as ending a CDATA section. For single-quote and double-quote, you can escape if you want to. That's it, for chardata inside elements. Other components of XML have other rules.Amicable
K
54

Escaping characters is different for tags and attributes.

For tags:

 < &lt;
 > &gt; (only for compatibility, read below)
 & &amp;

For attributes:

" &quot;
' &apos;

From Character Data and Markup:

The ampersand character (&) and the left angle bracket (<) must not appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they must be escaped using either numeric character references or the strings " &amp; " and " &lt; " respectively. The right angle bracket (>) may be represented using the string " &gt; ", and must, for compatibility, be escaped using either " &gt; " or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section.

To allow attribute values to contain both single and double quotes, the apostrophe or single-quote character (') may be represented as " &apos; ", and the double-quote character (") as " &quot; ".

Kelbee answered 5/2, 2014 at 10:3 Comment(1)
This implies that for attributes only quotes need to be escaped, but that is in addition to the other three charactersTrickster
G
29

In addition to the commonly known five characters [<, >, &, ", and '], I would also escape the vertical tab character (0x0B). It is valid UTF-8, but not valid XML 1.0, and even many libraries (including the highly portable (ANSI C) library libxml2) miss it and silently output invalid XML.

Glowing answered 25/4, 2012 at 13:38 Comment(0)
E
14

Abridged from: XML, Escaping

There are five predefined entities:

&lt; represents "<"
&gt; represents ">"
&amp; represents "&"
&apos; represents '
&quot; represents "

"All permitted Unicode characters may be represented with a numeric character reference." For example:

&#20013;

Most of the control characters and other Unicode ranges are specifically excluded, meaning (I think) they can't occur either escaped or direct:

Valid characters in XML

Eichman answered 15/8, 2014 at 7:53 Comment(0)
C
7

The accepted answer is not correct. Best is to use a library for escaping xml.

As mentioned in this other question

"Basically, the control characters and characters out of the Unicode ranges are not allowed. This means also that calling for example the character entity is forbidden."

If you only escape the five characters. You can have problems like An invalid XML character (Unicode: 0xc) was found

Catena answered 29/1, 2021 at 14:35 Comment(2)
Which library can be used?Vadavaden
Each language will be different. You can check Java in this other Stackoverflow question stackoverflow.com/a/439311Catena
C
4

It depends on the context. For the content, it is < and &, and ]]> (though a string of three instead of one character).

For attribute values, it is <, &, ", and '.

For CDATA, it is ]]>.

Chambliss answered 4/6, 2015 at 14:36 Comment(0)
D
-9

Only < and & are required to be escaped if they are to be treated character data and not markup:

2.4 Character Data and Markup

Democritus answered 2/4, 2014 at 14:17 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.