Scope of XML languages defined by DTD vs XSD

Asked 30/10, 2013 at 13:21 Answered 2/11, 2013 at 16:24

Does the following propositions hold: For every DTD there is an XSD that defines exactly the same language, and for every XSD there is a DTD that defines exactly the same language. Or put another way: The collection of languages defined by any DTD is exactly the the collection of languages defined by any XSD?

Expanding on the question a little: An XML document is basically a large string. A language is a collection of strings. For example, the (infinite) set of all MathML documents is a language, and so is the set of all RSS documents and so on. MathML (RSS, ...) is also a proper subset of the (infinite) set of all XML documents. You can use DTD or XSD to define such a subset of XML.

Now, every DTD defines exactly one language. But if you think of all possible DTDs, you get a set of languages. My question is, is this set exactly the same as the one you get from all possible XSDs? If so, then DTD and XSD are equivalent in the sense that the scope of XML languages defined by either is equal.

Why is this question important? If both DTD and XSD are equivalent then it is possible to write a program that takes a DTD as input and gives you an equivalent XSD, and another program that does the opposite. I know there are quite a few programs out there that claim to do exactly this, but I'm in doubt whether or not that's actually possible.

Glen answered 30/10, 2013 at 13:21 Comment(1)

Sounds like a riddle ;-) – Latinity 30/10, 2013 at 13:27

An interesting question; well asked!

The answer is "no", in both directions.

Here is a DTD which has no equivalent in XSD:

<!ELEMENT e (#PCDATA | e)* >
<!ENTITY egbdf "Every good boy deserves favor.">

The set of character sequences accepted by this DTD includes both <e/> and <e>&egbdf;</e>, but not <e>&beadgcf;</e>.

Since XSD validation operates on an information set in which entities have all already been expanded, no XSD schema can distinguish the third case from the second.

A second area where DTDs can express constraints not expressible in XSD involves NOTATION types. I won't give an example; the details are too complicated for me to remember them correctly without looking them up, and not interesting enough to make me want to do so.

A third area: DTDs treat namespace attributes (aka namespace declarations) and general attributes in the same way; a DTD can therefore constrain the appearance of namespace declarations in documents. An XSD schema cannot. The same applies to attributes in the xsi namespace.

If we ignore all of those issues, and formulate the question with respect only to character sequences containing no references to named entities other than the pre-defined entities lt, gt, etc., then the answer changes: for every DTD not involving NOTATION declarations, there is an XSD schema that accepts precisely the same set of documents after entity expansion and with 'same' defined in a way that ignores namespace attributes and attributes in the xsi namespace.

In the other direction, the areas of difference include these:

XSD is namespace aware: the following XSD schema accepts any instance of element e in the specified target namespace, regardless of what prefix is bound to that namespace in the document instance.
```
<xs:schema xmlns:xs="..." targetNamespace="http://example.com/nss/24397">
  <xs:element name="e" type="xs:string"/>
</xs:schema>
```
No DTD can successfully accept all and only the e elements in the given namespace.
XSD has a richer set of datatypes and can use datatypes to constrain elements as well as attributes. The following XSD schema has no equivalent DTD:
```
<xs:schema xmlns:xs="...">
  <xs:element name="e" type="xs:integer"/>
</xs:schema>
```
This schema accepts the document <e>42</e> but not the document <e>42d Street</e>. No DTD can make that distinction, because DTDs have no mechanism for constraining #PCDATA content. The closest DTD would be <!ELEMENT e (#PCDATA)>, which accepts both sample documents.
XSD's xsi:type attribute allows in-document modifications of content models. The XSD schema described by the following schema document has no equivalent DTD:
```
<xs:schema xmlns:xs="...">
  <xs:complexType name="e">
    <xs:sequence>
      <xs:element ref="e" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
  </xs:complexType>
  <xs:complexType name="e2">
    <xs:sequence>
      <xs:element ref="e" minOccurs="2" maxOccurs="2"/>
    </xs:sequence>
  </xs:complexType>

  <xs:element name="e" type="e"/>
</xs:schema>
```
This schema accepts the document <e xmlns:xsi="..." xsi:type="e2"><e/><e/></e> and rejects the document <e xmlns:xsi="..." xsi:type="e2"><e/><e/><e/></e>. DTDs have no mechanism for making content models depend on an attribute value given in the document instance.
XSD wildcards allow the inclusion of arbitrary well-formed XML among the children of specified elements; the closest one can come to that with a DTD is to use an element declaration of the form <!ELEMENT e ANY>, which is not the same because it requires declarations for all the elements which in fact appear.
XSD 1.1 provides assertions and conditional type assignment, which have no analogues in DTDs.

There are probably other ways in which the expressive power of XSD exceeds that of DTDs, but I think the point has been illustrated adequately.

I think a fair summary would be: XSD can express everything DTDs can express, with the exception of entity declarations and special cases like namespace declarations and xsi:* attributes, because XSD was designed to be able to do so. So the loss of information when translating a DTD to an XSD schema document is relatively modest, well understood, and mostly involves things most vocabulary designers regard as DTD artefacts not of substantive interest.

XSD can express more than DTDs can, again because XSD was designed to do so. In the general case, translation from XSD to DTD necessarily involves loss of information (the set of documents accepted may need to be larger, or smaller, or to be an overlapping set). Different choices can be made about how to manage the loss of information, which gives the question "How does one best translate an XSD into DTD form?" a certain theoretical interest. (Very few people, however, seem to find it an interesting question in practice.)

All of this focuses, as did your question, on documents as character sequences, on languages as document sets, and on schema languages as generators of languages in that sense. Issues of maintainability and information present in the schema that does not turn into differences in the extension of document sets (e.g. the treatment of class hierarchies in the document model) is left out of account.

Nunn answered 2/11, 2013 at 16:24 Comment(1)

Thank you very much for your elaborate answer. This is exactly the kind of answer I was looking for. – Glen 2/11, 2013 at 16:52

Without qualifiers, the answer is no.

You have to define what is it you call a "language". In my mind, these you refer to are languages meant to define document schemata. A schemata defines constraints on the document structure and content. The constraints expressible by XSD are far more powerful than DTD. So no, they wouldn't be the same.

A comparison of DTD vs. XSD might help you understand why not.

Cosimo answered 30/10, 2013 at 15:8 Comment(9)

I have expanded a little on the question. I know that XSD is more expressive, but that doesn't necessarily mean that you can use it to define XML formats that you cannot define using DTD. – Glen 31/10, 2013 at 11:33

@alexraasch, you really need to look up a DTD vs. XSD comparison. You have to define what is it you call "format" - it is all in what one language can or cannot do, vs. the other. For e.g., DTD has no clue about namespaces, nor referential integrity constraints, nor does it have the ability to fully reflect object orientation concepts or user defined types... Additional "expressiveness" is there for a reason; if those reasons don't apply to your comparative study, then the outcome may be different.... – Cosimo 31/10, 2013 at 13:17

(cont'd) Even if you limit this to that which is the definition of a tag and attribute set (is this what you call "format"?), you would need to take away XML namespaces, namespace and element scoping, cardinality constraints such as [2:5], etc. to say they're the same. – Cosimo 31/10, 2013 at 13:22

Well, if you can't define namespaces in DTD then that's enough to tell that both DTD and XSD are NOT equivalent. So in general, you can't write a program that converts either type to the other. Thanks Petru. – Glen 1/11, 2013 at 21:4

@alexraasch - you say "I know that XSD is more expressive, but that doesn't necessarily mean that you can use it to define XML formats that you cannot define using DTD" -- but that is precisely what the technical term more expressive does mean, necessarily and exclusively. – Nunn 2/11, 2013 at 16:26

No it does not. For instance, the C programming language is more expressive than Assembly language. But since both are Turing complete and C compiles to Assembly, there is no program that you can write in C that you cannot write in Assembly. It's different from, say, the expressiveness of propositional logic vs. predicate logic. – Glen 2/11, 2013 at 16:43

OK, it all boils down to what exactly "expressiveness" means. – Glen 2/11, 2013 at 16:53

@alexraasch, yes... You have to define what is it you call a "language". – Cosimo 2/11, 2013 at 17:1

@alexraasch, it is for precisely that reason that I (among others) would not agree that C is more expressive than Assembly language. It is more succinct; it may be more suggestive; it is not more expressive, as that term is normally defined in comparisons of expressive power: mechanism A is more expressive than mechanism B if everything expressible by B can also be expressed by A and not vice versa. You may use words any way you wish, but if you wish to understand and be understood, you will need to take standard technical terms in their standard technical senses. – Nunn 2/11, 2013 at 18:23

Recommended topics

Hot tags