The reason is that a unicode text should start with the byte order mark (except UTF-8 where it is not recommended mandatory[1]).
from Wikipedia
The byte order mark (BOM) is a Unicode character, U+FEFF BYTE ORDER MARK (BOM), whose appearance as a magic number at the start of a text stream ...
...
The BOM is encoded in the same scheme as the rest of the document ...
Which means this special character (\uFEFF
) must also be encoded in UTF-8.
UTF-8 can encode Unicode code points in one to four bytes.
- code points which can be represented with 7 bits are encoded in one byte, the highest bit is always zero
0xxx xxxx
- all other code points encoded in multiple bytes depending on the number of bits, the left set bits of the first byte represent the number of bytes used for the encoding, e.g.
110x xxxx
means the encoding is represented by two bytes, continuation bytes always start with 10xx xxxx
(the x
bits can be used for the code points)
The code points in the range U+0000 - U+007F
can be encoded with one byte.
The code points in the range U+0080 - U+07FF
can be encoded with two bytes.
The code points in the range U+0800 - U+FFFF
can be encoded with three bytes.
A detailed explanation is on Wikipedia
For the BOM we need three bytes.
hex FE FF
binary 11111110 11111111
encode the bits in UTF-8
pattern for three byte encoding 1110 xxxx 10xx xxxx 10xx xxxx
the bits of the code point 1111 11 1011 11 1111
result 1110 1111 1011 1011 1011 1111
in hex EF BB BF
EF BB BF
sounds already familiar. ;-)
The byte sequence EF BB BF
is nothing else than the BOM encoded in UTF-8.
As the byte order mark has no meaning for UTF-8 it is not used in Java.
encoding the BOM character as UTF-8
jshell> "\uFEFF".getBytes("UTF-8")
$1 ==> byte[3] { -17, -69, -65 } // EF BB BF
Hence when the file is read the byte sequence gets decoded to \uFEFF
.
For encoding e.g. UTF-16 the BOM is added
jshell> " ".getBytes("UTF-16")
$2 ==> byte[4] { -2, -1, 0, 32 } // FE FF + the encoded SPACE
[1] cited from: http://www.unicode.org/versions/Unicode9.0.0/ch23.pdf
Although there
are never any questions of byte order with UTF-8 text, this sequence can serve as signature
for UTF-8 encoded text where the character set is unmarked. As with a BOM in UTF-16,
this sequence of bytes will be extremely rare at the beginning of text files in other character
encodings.