Does a valid XML file require an XML declaration?
Asked Answered
S

3

137

I am parsing an XML file using Sax Parser of Xerces.
Is the XML declaration <?xml version="1.0" encoding="UTF-8"?> required?

Scriptorium answered 10/8, 2011 at 7:45 Comment(4)
There is a difference between valid and well-formed documents. Which of those do you mean?Phidias
I am receiving prolog error/invalid utf-8 encoding. Then i found BOM in XML file which the user open the file using notepad (i can't avoid this). i am not sure i'm referring to a valid or well-formed documents. Just need to avoid the errors that's why I am creating a function that remove all bytes prior to "<". Which I need to make sure that xml header declaration is required. What do you think guys?Scriptorium
Is there a java class does the removal of BOM? or few bytes from the xml file? from InputStream. I am thinking of skip method from FilterInputStream & PushbackInputStream but don't have idea on how to use it.Scriptorium
@eros: "i am not sure i'm referring to a valid or well-formed documents" See Well-formed vs Valid XML for a concise explanation of the difference.Bagasse
W
203

In XML 1.0, the XML Declaration is optional. See section 2.8 of the XML 1.0 Recommendation, where it says it "should" be used -- which means it is recommended, but not mandatory. In XML 1.1, however, the declaration is mandatory. See section 2.8 of the XML 1.1 Recommendation, where it says "MUST" be used. It even goes on to state that if the declaration is absent, that automatically implies the document is an XML 1.0 document.

Note that in an XML Declaration the encoding and standalone are both optional. Only the version is mandatory. Also, these are not attributes, so if they are present they must be in that order: version, followed by any encoding, followed by any standalone.

<?xml version="1.0"?>
<?xml version="1.0" encoding="UTF-8"?>
<?xml version="1.0" standalone="yes"?>
<?xml version="1.0" encoding="UTF-16" standalone="yes"?>

If you don't specify the encoding in this way, XML parsers try to guess what encoding is being used. The XML 1.0 Recommendation describes one possible way character encoding can be autodetected. In practice, this is not much of a problem if the input is encoded as UTF-8, UTF-16 or US-ASCII. Autodetection doesn't work when it encounters 8-bit encodings that use characters outside the US-ASCII range (e.g. ISO 8859-1) -- avoid creating these if you can.

The standalone indicates whether the XML document can be correctly processed without the DTD or not. People rarely use it. These days, it is a bad to design an XML format that is missing information without its DTD.

Update:

A "prolog error/invalid utf-8 encoding" error indicates that the actual data the parser found inside the file did not match the encoding that the XML declaration says it is. Or in some cases the data inside the file did not match the autodetected encoding.

Since your file contains a byte-order-mark (BOM) it should be in UTF-16 encoding. I suspect that your declaration says <?xml version="1.0" encoding="UTF-8"?> which is obviously incorrect when the file has been changed into UTF-16 by NotePad. The simple solution is to remove the encoding and simply say <?xml version="1.0"?>. You could also edit it to say encoding="UTF-16" but that would be wrong for the original file (which wasn't in UTF-16) or if the file somehow gets changed back to UTF-8 or some other encoding.

Don't bother trying to remove the BOM -- that's not the cause of the problem. Using NotePad or WordPad to edit XML is the real problem!

Watkin answered 10/8, 2011 at 8:20 Comment(3)
My question was answered but my follow question was not. Do I need to create another question for that? or please add it here.Scriptorium
The BOM can be the cause of the problem. Some older XML parsers will not accept a BOM at the start of a UTF-8 document (it was designed for UTF-16, and only became acceptable with UTF-8 later). But it's unlikely to be a problem if you're using a recent version of Xerces.Blanca
Also note, that in the "Save As" dialog in notepad you can choose what encoding to save your XML as. If you want to remove the BOM, just save as "ASCII" (assuming you're not using any Unicode characters). For the lower 127 characters, ASCII and UTF-8 are identical.Paiz
W
9

Xml declaration is optional so your xml is well-formed without it. But it is recommended to use it so that wrong assumptions are not made by the parsers, specifically about the encoding used.

Wilkie answered 10/8, 2011 at 7:47 Comment(3)
Am I the only one that finds it bizarre that you tell XML parsers what encoding to use after they've already started decoding your document? I mean clearly, if it can parse that tag and understand what it says, then it has already figured out the correct encoding. I can't think of any legitimate use for the encoding attribute.Paiz
@Paiz In no BOM, the encoding is specified to be 8-bit. So either ASCII or UTF-8 or any of them old 8-bit national encoding. XML declaration is all lower half 8-bit, which is equal among all those encodings and conveys enough infromation to choose the upper half. Not the best of design, but still better than guessing between, say, CP1241 and CP866 as was common for text files of them olden days.Spiro
But they should have gone clean and say XML is UTF-8 - end of story.Behold
R
5

It is only required if you aren't using the default values for version and encoding (which you are in that example).

Reagent answered 10/8, 2011 at 7:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.