PCDATA vs CDATA in XML DTD

Asked 9/12, 2013 at 15:20 Answered 18/12, 2013 at 7:42

In XML DTD's - When defining an element , we use #PCDATA to say that this element can contain any parseable text. When defining an attribute , we use CDATA to say that its value can be any character data.

CDATA as is used in XML is something which is not parsed by the XML parser (Multi character escape sequence). Consistently, when we use CDATA for defining an attribute ; the parser should not parse it. But , it does!

Then , Why Could not PCDATA have been used in place of CDATA for defining attributes?

Update - This has been kept this way to be backward compatible with SGML. What's the reasoning behind such naming in SGML ?

Ligation answered 9/12, 2013 at 15:20 Comment(4)

possible duplicate of #918950 – Trickster 9/12, 2013 at 15:42

this one is based on the conclusion of the question you mention... – Ligation 9/12, 2013 at 16:24

How are you using CDATA for an attribute? This should not be possible. https://mcmap.net/q/737757/-specifying-attribute-values-as-cdata/231316 – Vang 9/12, 2013 at 16:38

I meant the case when CDATA is used for defining the type of attribute..not the CDATA section.. – Ligation 9/12, 2013 at 16:48

When used in the declared value of an attribute CDATA refers to the actual value of the attribute (character data), not to the context in which it is parsed. On the other hand, when parsing elements we need a distinction between character-data-with-no-markup (CDATA) and parsed-character-data-where-delimiters-are expected (PCDATA) .

At first glance this seems arbitrary, but it is not (see here and here).

In SGML, an attribute value specification may either be quoted (attribute value literal) or unquoted (attribute value).

attribute value specification = attribute value literal | attribute value

When the attribute is unquoted, only NAME-characters are allowed and that may be further restricted for some declared values such as NUMBER.

The content of an attribute value literal, on the other hand, is a sequence of replaceable character data surrounded by LIT/LITA delimiters (double and single quotes, respectively, in the reference concrete syntax).

attribute value literal =
   ( LIT , replaceable character data *, LIT) | 
   ( LITA , replaceable character data *, LITA)

Replaceable character data is "like CDATA except that entity references and character references are recognized" (Goldfarb, the SGML Handbook).

It follows that the replacement of entity references in attribute value literals do not depend on the declared value of the attribute. Therefore, if you have <!ENTITY foo "bar"> and <elem attr="&foo;"> the entity reference &foo; will be parsed in the context of replaceable character data (LIT recognition mode), yielding <elem attr=bar>. It does not matter if attr is declared as CDATA, NAME or whatever.

Update

There is no need to say that entities in an attribute have to be parsed, because all attribute types have the same parsing rules: if the attribute value starts with a quote (LIT), then entities are recognized (replaceable character data) and the value ends when a matching end-quote is found.

Here CDATA means that a valid attribute must contain arbitrary character data after expanding entities. Had the attribute been declared as NUMBER, it would have been required to contain numeric characters (or entities that are expanded to numeric characters).

In the example above, the CDATA attribute with value "&foo;" is equivalent to "bar", in the same way that a NUMBER attribute with value "0" is equivalent to "0" (even though the sequence "0" contains characters other than numeric).

Kean answered 18/12, 2013 at 7:42 Comment(6)

My Point is that since CDATA already has a meaning in the context of parsing , why not use a new name for attribute definition? This second use of CDATA(in case of attribute definition) seems ambiguous to its use in the 1st situation(element definition : CDATA elements are not parsed)... – Ligation 23/12, 2013 at 13:29

I understand your point. The question on why the standard uses the same keyword on both places should be asked to the SGML authors. We, mere mortals, can only elaborate on how this choice is consistent with other uses of CDATA. – Kean 23/12, 2013 at 15:12

That's the reason I asked , maybe there's something that I don't know which may explain the consistency. – Ligation 25/12, 2013 at 10:20

IMHO the consistency of such naming is explained in my answer above. – Kean 26/12, 2013 at 15:0

I understand that CDATA in terms of attributes means something different. The question was regarding using the same word in both the places. Perhaps, it would have been best if instead of CDATA Sections...they were called NPCDATA sections...(non parsed character data sections)... – Ligation 2/1, 2014 at 17:59

@Ligation - If you are going to suggest a term replacement, then NPCDATA is too much typing, as NP is implied by the lack of a P. :) Better to suggest that the attribute declaration use the keyword RCDATA to stand for 'Replaceable character data' as Goldfarb defines it. -- But still, it is way too late to suggest even this to the SGML people would could have done something about it. – Bowstring 7/8, 2015 at 20:42

A CDATA section, like you would use in an element, is different from the CDATA attribute type.

The parsing that you are most likely observing (like entity references being resolved) is from attribute-value normalization.

Decompose answered 9/12, 2013 at 16:40 Comment(7)

This seems kind of ambiguous to me. The way this CDATA attribute type works is like the PCDATA type for element definition in a DTD. Why was the same name CDATA used, would not PCDATA have been better? – Ligation 9/12, 2013 at 16:50

@Ligation - I don't know why CDATA was used for attributes instead of #PCDATA. If there are differences, I'm not sure what they are. – Decompose 9/12, 2013 at 17:37

I think of PCDATA as something that modifies the document's actual structure whereas CDATA is arbitrary text. Using this definition I think attributes as CDATA makes sense. Attributes and sections have different rules for escaping things within their CDATA but they both ultimately represent a string that doesn't change the structure (except for existing in the first place). – Vang 9/12, 2013 at 17:48

What do you exactly mean by "changing the actual structure of a document" ? – Ligation 10/12, 2013 at 15:16

@Ligation - Please add another answer instead of editing mine. Your edit is a completely different answer. – Decompose 2/1, 2014 at 18:22

@DanielHaley - re: attributes with #PCDATA - this would imply you were allowed to have full markups in an attribute value, like <BR/> and that is not allowed. I agree it was strange of the SGML authors to choose CDATA (instead of RCDATA for Replaceable CDATA which is how Goldfarb defined it) for attribute declarations. But they did, and we must now live with it. Fortunately, it only hurts your brain for about a day, then you figure it out and move on, living with a bit of context sensitive ambiguity in your life. :) – Bowstring 7/8, 2015 at 20:48

@Ligation - re: Chris Haas' comment about changing the structure -- I am guessing he means adding more nodes to the DOM. An attribute is a node, but its value cannot add any other nodes. A CDATA section is a node, but its contents cannot add any other nodes. Whereas PCDATA is capable of adding an arbitrary number of nodes to the DOM. – Bowstring 7/8, 2015 at 20:54

Recommended topics

Hot tags