Which are the HTML, and XML, special characters?
Asked Answered
S

1

28

What are the special reserved character entities in HTML and in XML?

The information that I have says:

HTML:

  • & (replace with &)
  • < (replace with &lt;)
  • > (replace with &gt;)
  • " (replace with &quot;)
  • ' (replace with &apos;)

XML:

  • < (replace with &lt;)
  • > (replace with &gt;)
  • & (replace with &amp;)
  • ' (replace with &apos;)
  • " (replace with &quot;)

But I cannot find documentation on either of these.

The W3C does mention, in Extensible Markup Language (XML) 1.0 (Fifth Edition), certain predefined entity references. But it says that these entities are predefined (in the same way that &copy; is predefined); not that they must be escaped:

4.6 Predefined Entities

[Definition: Entity and character references may both be used to escape the left angle bracket, ampersand, and other delimiters. A set of general entities (amp, lt, gt, apos, quot) is specified for this purpose. Numeric character references may also be used; they are expanded immediately when recognized and must be treated as character data, so the numeric character references " &#60; " and " &#38; " may be used to escape < and & when they occur in character data.]

What characters must be escaped into entity references in HTML? What characters must be escaped into entity references in XML?


Update:

From Extensible Markup Language (XML) 1.0 (Fifth Edition):

2.4 Character Data and Markup

The ampersand character (&) and the left angle bracket (<) must not appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they must be escaped using either numeric character references or the strings "&amp;" and "&lt;" respectively.

The right angle bracket (>) may be represented using the string "&gt;", and must, for compatibility, be escaped using either "&gt;" or a character reference when it appears in the string "]]>" in content, when that string is not marking the end of a CDATA section.

To allow attribute values to contain both single and double quotes, the apostrophe or single-quote character (') may be represented as "&apos;", and the double-quote character (") as "&quot;".

I read the former as saying that

must be:

  • < (&lt;) must be
  • & (&amp;) must be

may, but must when appearing as ]]>

  • > (&gt;) must be, if appearing as ]]>

And that ' and " don't have to be escaped at all; unless you want to have quotes inside quoted attributes.


From HTML 4.01 Specification, HTML Document Representation:

5.3.2 Character entity references

Authors wishing to put the "<" character in text should use "&lt;" (ASCII decimal 60) to avoid possible confusion with the beginning of a tag (start tag open delimiter).

Similarly, authors should use "&gt;" (ASCII decimal 62) in text instead of ">" to avoid problems with older user agents that incorrectly perceive this as the end of a tag (tag close delimiter) when it appears in quoted attribute values.

Authors should use "&amp;" (ASCII decimal 38) instead of "&" to avoid confusion with the beginning of a character reference (entity reference open delimiter). Authors should also use "&amp;" in attribute values since character references are allowed within CDATA attribute values.

Some authors use the character entity reference "&quot;" to encode instances of the double quote mark (") since that character may be used to delimit attribute values.

HTML is much more wishy-washy on the rules, but it sounds like I should:

  • < should be with &lt;
  • > should be with &gt;
  • & should be with &amp;
  • " should be with &quot;

And if " can be an entity reference, I should also replace ' with &amp;.


Update Two

From HTML5 - A vocabulary and associated APIs for HTML and XHTML:

8.3 Serializing HTML fragments

Escaping a string (for the purposes of the algorithm above) consists of running the following steps:

Replace any occurrence of the "&" character by the string "&amp;".

Replace any occurrences of the U+00A0 NO-BREAK SPACE character by the string "&nbsp;".

If the algorithm was invoked in the attribute mode, replace any occurrences of the """ character by the string "&quot;".

If the algorithm was not invoked in the attribute mode, replace any occurrences of the "<" character by the string "&lt;", and any occurrences of the ">" character by the string "&gt;".

Which I read as HTML:

  • & by &amp; always
  • by &nbsp; always
  • " by &quot; if it's inside an attribute
  • < by &lt; if it's not in an attribute (i.e. attributes can contain <)
  • > by &gt; if it's not in an attribute (i.e. attributes can contain >)
Sergei answered 30/8, 2011 at 19:50 Comment(5)
You should really separate out your question from your answer.Magree
I don't have an answer. I have research that may or may not be correct. Someone who knows the answer can post it, and people who know can vote it up if it is, in fact, correct.Sergei
If the above isn't an answer, then you need to be a lot more clear on what you're looking for!Magree
i found five different sources that say three different things. Someone who knows needs to decide which one of the sources is right, and which is wrong.Sergei
Weird that HTML 4 and 5 say opposite things with regard to when you should escape > (in an attribute or not).Patronymic
N
12

First, you're comparing a HTML 4.01 specification with an HTML 5 one. HTML5 ties more closely in with XML than HTML 4.01 ever does (that's why we have XHTML), so this answer will stick to HTML 5 and XML.

Your quoted references are all consistent on the following points:

  • < should always be represented with &lt; when not indicating a processing instruction
  • > should always be represented with &gt; when not indicating a processing instruction
  • & should always be represented with &amp;
  • except when within <![CDATA[ ]]> (which only applies to XML)

I agree 100% with this. You never want the parser to mistake literals for instructions, so it's a solid idea to always encode any non-space (see below) character. Good parsers know that anything contained within <![CDATA[ ]]> are not instructions, so the encoding is not necessary there.

In practice, I never encode ' or " unless

  • it appears within the value of an attribute (XML or HTML)
  • it appears within the text of XML tags. (<tag>&quot;Yoinks!&quot;, he said.</tag>)

Both specifications also agree with this.

So, the only point of contention is the (space). The only mention of it in either specification is when serialization is attempted. When not, you should always use a literal (space). Unless you are writing your own parser, I don't see the need to be doing any kind of serialization, so this is beside the point.

Neurosis answered 2/9, 2011 at 3:48 Comment(2)
There is no reason to escape > except in the very special and extremely rare case of ]]> in data in XML linearization. It may be escaped, if desired, for symmetry (with escaping <). This is what the references actually say. And there is no reason to escape ' or " except within attribute value when the same character is used as attribute value delimiter.Papistry
If you only encode quotes if they appear inside an attribute value, or inside the element text content, in what other context does that leave that text appears that you don't escape them?Inexertion

© 2022 - 2024 — McMap. All rights reserved.