Is the XML declaration node mandatory?
Asked Answered
H

4

20

I had a discussion with a colleague of mine about the XML declaration node (I'm talking about this => <?xml version="1.0" encoding="UTF-8"?>).

I believe that for something to be called "valid XML", it requires a XML declaration node.

My colleague states that the XML declaration node is optionnal, since the default encoding is UTF-8 and the version is always 1.0. This make sense, but what does the standard says ?

In short, given the following file:

<books>
  <book id="1"><title>Title</title></book>
</book>

Can we say that:

  1. It is valid XML ?
  2. It is a valid XML node ?
  3. It is a valid XML document ?

Thank you very much.

Harmonia answered 13/1, 2011 at 10:40 Comment(0)
G
40

This:

<?xml version="1.0" encoding="UTF-8"?>

is not a processing instruction - it is the XML declaration. Its purpose is to configure the XML parser correctly before it starts reading the rest of the document.

It looks like a processing instruction, but unlike a real processing instruction it will not be part of the DOM the parser creates.

It is not necessary for "valid" XML. "Valid" means "represents a well-defined document type, as described in a DTD or a schema". Without a schema or DTD the word "valid" has no meaning.

Many people mis-use "valid" when they really mean "well-formed". A well-formed XML document is one that obeys the basic syntax rules of XML.

There is no XML declaration necessary for a document to be well-formed, either, since there are defaults for both version and encoding (1.0 and UTF-8/UTF-16, respectively). If a Unicode BOM (Byte Order Mark) is present in the file, it determines the encoding. If there is no BOM and no XML declaration, UTF-8 is assumed.

Here is a canonical thread on how encoding declaration and detection works in XML files. How default is the default encoding (UTF-8) in the XML Declaration?


To your questions:

  1. It is valid XML ?
    This cannot be answered without a DTD or a schema. It is well-formed, though.
  2. It is a valid XML node ?
    A node is a concept that is related to an in-memory representation of a document (a DOM). This snippet can be parsed into a node, since it is well-formed.
  3. It is a valid XML document ?
    See #1.

You are confusing a few XML concepts here (not to worry, this confusion is common and stems partly from the fact that the concepts overlap and names are mis-used rather often).

  • It all starts with structured data consisting of names, values and attributes that is organized as a tree.
  • XML means, most basically, a syntax to represent this structured data in textual form (it's a "Markup Language"). It is what you get when you serialize the tree into a string of characters and it can be used to de-serialize a string of characters into a tree again.
  • Document usually refers to a string of characters that represent a serialized tree. It can be stored in a file, sent over the network or created in-memory.
  • The rules of serialization and de-serialization are very strictly defined. A document (a "string of characters") that can successfully be de-serialized into a tree is said to be well-formed.
  • The semantics of such a tree (allowed elements, element count and order, namespaces, any number of complex rules, really) can be defined in what is called a DTD or a schema. If a tree obeys a certain set of well-defined semantics, it is said to be valid.
  • The term Document Object Model (DOM) refers to the standardized in-memory representation of structured data. It's the name of the a well-defined API to access this tree with standardized methods.
  • A node is the basic data structure of a Document Object Model.
Goren answered 13/1, 2011 at 10:50 Comment(3)
+1. Thanks you very much for this complete and very instructive answer. I updated my question replacing "processing instructions node" with "XML declaration" in case someone looks for the same question using the proper terms.Harmonia
The default encoding isn't simply UTF-8. If the encoding isn't specified in the XML declaration, the encoding can be UTF-8 or UTF-16 if defined in the Byte Order Mark (BOM), or finally is UTF-8 if no BOM is present.Reube
Good point, I did not think about that. I've added a bit of clarification and a link to a thread that goes into the details.Goren
P
4

According to the Extensible Markup Language (XML) 1.0 (Fifth Edition) W3C Recommendation 26 November 2008, section: http://www.w3.org/TR/2008/REC-xml-20081126/#sec-prolog-dtd
without xml declaration, it is not valid (even though it is well-formed, complete).

Pizza answered 13/1, 2011 at 10:52 Comment(1)
The spec states that XML documents SHOULD begin with an XML declaration. It does not say that without an XML declaration an XML document is not valid.Goren
G
1

the specification states:

Definition: XML documents SHOULD begin with an XML declaration which specifies the version of XML being used.

And also for a document to be valid it should have a document type declaration associated with it. The snippet you show here seems to be a wellformed node, but in no way a valid document.

Gwenni answered 13/1, 2011 at 10:48 Comment(0)
C
0

Note that validity depends on the DTD or schema associated with the document. In your case

<books>
  <book id="1"><title>Title</title></book>
</book>

the minimum a DTD must have would be ELEMENT "books", "book" and "title", and that "book" has an ATTLIST with "id", define the type of "id" and whether it was mandatory or optional. It would also declare that "book" could/must contain "title" and "title" could (or must) contain PCDATA content (string).

A DTD might also declare that certain other elements must be present, in which case your XML document would be invalid. There are many DTDs which would make your document valid and many which would make it invalid.

Chuckchuckfull answered 13/1, 2011 at 11:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.