Meaning of - <?xml version="1.0" encoding="utf-8"?>
Asked Answered
S

5

125

I am new to XML and I am trying to understand the basics. I read the line below in "Learning XML", but it is still not clear, for me. Can someone point me to a book or website which explains these basics clearly?

From Learning XML:

The XML declaration describes some of the most general properties of the document, telling the XML processor that it needs an XML parser to interpret this document.

What does this mean?

I understand the xml version part - both doc and user of doc should "talk" in the same version of XML. But what about the encoding part? Why is that necessary?

Struggle answered 6/12, 2012 at 12:3 Comment(2)
w3.org/TR/xmlSecularism
Possible duplicate of What use is the 'encoding' in the XML header?Tabbi
S
147

To understand the "encoding" attribute, you have to understand the difference between bytes and characters.

Think of bytes as numbers between 0 and 255, whereas characters are things like "a", "1" and "Ä". The set of all characters that are available is called a character set.

Each character has a sequence of one or more bytes that are used to represent it; however, the exact number and value of the bytes depends on the encoding used and there are many different encodings.

Most encodings are based on an old character set and encoding called ASCII which is a single byte per character (actually, only 7 bits) and contains 128 characters including a lot of the common characters used in US English.

For example, here are 6 characters in the ASCII character set that are represented by the values 60 to 65.

Extract of ASCII Table 60-65
╔══════╦══════════════╗
║ Byte ║  Character   ║
╠══════╬══════════════║
║  60  ║      <       ║
║  61  ║      =       ║
║  62  ║      >       ║
║  63  ║      ?       ║
║  64  ║      @       ║
║  65  ║      A       ║
╚══════╩══════════════╝

In the full ASCII set, the lowest value used is zero and the highest is 127 (both of these are hidden control characters).

However, once you start needing more characters than the basic ASCII provides (for example, letters with accents, currency symbols, graphic symbols, etc.), ASCII is not suitable and you need something more extensive. You need more characters (a different character set) and you need a different encoding as 128 characters is not enough to fit all the characters in. Some encodings offer one byte (256 characters) or up to six bytes.

Over time a lot of encodings have been created. In the Windows world, there is CP1252, or ISO-8859-1, whereas Linux users tend to favour UTF-8. Java uses UTF-16 natively [see comments].

One sequence of byte values for a character in one encoding might stand for a completely different character in another encoding, or might even be invalid.

For example, in ISO 8859-1, â is represented by one byte of value 226, whereas in UTF-8 it is two bytes: 195, 162. However, in ISO 8859-1, 195, 162 would be two characters, Ã, ¢.

Think of XML as not a sequence of characters but a sequence of bytes.

Imagine the system receiving the XML sees the bytes 195, 162. How does it know what characters these are?

In order for the system to interpret those bytes as actual characters (and so display them or convert them to another encoding), it needs to know the encoding used in the XML.

Since most common encodings are compatible with ASCII, as far as basic alphabetic characters and symbols go, in these cases, the declaration itself can get away with using only the ASCII characters to say what the encoding is. In other cases, the parser must try and figure out the encoding of the declaration. Since it knows the declaration begins with <?xml it is a lot easier to do this.

Finally, the version attribute specifies the XML version, of which there are two at the moment (see Wikipedia XML versions. There are slight differences between the versions, so an XML parser needs to know what it is dealing with. In most cases (for English speakers anyway), version 1.0 is sufficient.

Seminar answered 10/12, 2014 at 10:18 Comment(4)
"The header itself uses the ASCII encoding": I think you are referring the XML declaration. It is encoded like the rest of the document; UTF-16 or whatnot. An XML processor can make a few trials until it can read the encoding specification.Extraditable
I was under the impression that the preamble/prolog was to be encoded under UTF-8 and that told the parser how to convert the remaining bytes (the actual XML document) to the correct encoding. Wrong again! :-)Kreindler
Here is a suggested reading: joelonsoftware.com/2003/10/08/…Thought
Since Java 9 compact strings (JEP 254), "Java uses UTF-16 natively" is no longer always the case.Volant
M
7

This is the XML optional preamble.

  • version="1.0" means that this is the XML standard this file conforms to
  • encoding="utf-8" means that the file is encoded using the UTF-8 Unicode encoding
Madelene answered 6/12, 2012 at 12:6 Comment(0)
E
4

The encoding declaration identifies which encoding is used to represent the characters in the document.

More on the XML Declaration here: http://msdn.microsoft.com/en-us/library/ms256048.aspx

Empathic answered 6/12, 2012 at 12:6 Comment(0)
H
3

Can someone point me to a book or website which explains these basics clearly ?

You can check this XML Tutorial with examples.

But what about the encoding part ? Why is that necessary ?

W3C provides explanation about encoding :

"The document character set for XML and HTML 4.0 is Unicode (aka ISO 10646). This means that HTML browsers and XML processors should behave as if they used Unicode internally. But it doesn't mean that documents have to be transmitted in Unicode. As long as client and server agree on the encoding, they can use any encoding that can be converted to Unicode..."

Henden answered 20/7, 2014 at 6:15 Comment(0)
M
0

The XML declaration in the document map consists of the following:

The version number, ?xml version="1.0"?. 

This is mandatory. Although the number might change for future versions of XML, 1.0 is the current version.

The encoding declaration,

encoding="UTF-8"?

This is optional. If used, the encoding declaration must appear immediately after the version information in the XML declaration, and must contain a value representing an existing character encoding.

Mestee answered 27/4, 2013 at 17:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.