Why UTF-8 BOM bytes efbbbf can be replaced by \ufeff?
Asked Answered
T

2

9

The byte order mark (BOM) for UTF-8 is EF BB BF, as noted in section 23.8 of the Unicode 9 specification (search for "signature").

Many solutions in Java to remove this is just a simple one-line code:

 replace("\uFEFF", "")

I don't understand this why this works.

Here is my test code. I check the binary after calling String#replace where I find that EF BB BF is INDEED removed. See this code run live at IdeOne.com.

So magic. Why does this work?

@Test
public void shit() throws Exception{
    byte[] b = new byte[]{-17,-69,-65, 97,97,97};//EF BB BF 61 61 61
    char[] c = new char[10];
    new InputStreamReader(new ByteArrayInputStream(b),"UTF-8").read(c);
    byte[] bytes = new StringBuilder().append(c).toString().replace("\uFEFF", "").getBytes();//
    for(byte bt: bytes){//61 61 61, we can see EF BB BF is indeed removed
        System.out.println(bt);
    }
}
Tantalous answered 18/1, 2019 at 3:32 Comment(2)
You are confusing encoding with the character codepoint. Also, in normal use, UTF-8 encoded content should not use a BOM.Sisto
Related: How to add a UTF-8 BOM in Java?Bismuthous
T
13

The reason is that a unicode text should start with the byte order mark (except UTF-8 where it is not recommended mandatory[1]).

from Wikipedia

The byte order mark (BOM) is a Unicode character, U+FEFF BYTE ORDER MARK (BOM), whose appearance as a magic number at the start of a text stream ...
...
The BOM is encoded in the same scheme as the rest of the document ...

Which means this special character (\uFEFF) must also be encoded in UTF-8.

UTF-8 can encode Unicode code points in one to four bytes.

  • code points which can be represented with 7 bits are encoded in one byte, the highest bit is always zero 0xxx xxxx
  • all other code points encoded in multiple bytes depending on the number of bits, the left set bits of the first byte represent the number of bytes used for the encoding, e.g. 110x xxxx means the encoding is represented by two bytes, continuation bytes always start with 10xx xxxx (the x bits can be used for the code points)

The code points in the range U+0000 - U+007F can be encoded with one byte.
The code points in the range U+0080 - U+07FF can be encoded with two bytes. The code points in the range U+0800 - U+FFFF can be encoded with three bytes.

A detailed explanation is on Wikipedia

For the BOM we need three bytes.

hex    FE       FF
binary 11111110 11111111

encode the bits in UTF-8

pattern for three byte encoding 1110 xxxx  10xx xxxx  10xx xxxx
the bits of the code point           1111    11 1011    11 1111
result                          1110 1111  1011 1011  1011 1111
in hex                          EF         BB         BF

EF BB BF sounds already familiar. ;-)

The byte sequence EF BB BF is nothing else than the BOM encoded in UTF-8.

As the byte order mark has no meaning for UTF-8 it is not used in Java.

encoding the BOM character as UTF-8

jshell> "\uFEFF".getBytes("UTF-8")
$1 ==> byte[3] { -17, -69, -65 }  // EF BB BF

Hence when the file is read the byte sequence gets decoded to \uFEFF.

For encoding e.g. UTF-16 the BOM is added

jshell> " ".getBytes("UTF-16")
$2 ==> byte[4] { -2, -1, 0, 32 }  // FE FF + the encoded SPACE

[1] cited from: http://www.unicode.org/versions/Unicode9.0.0/ch23.pdf

Although there are never any questions of byte order with UTF-8 text, this sequence can serve as signature for UTF-8 encoded text where the character set is unmarked. As with a BOM in UTF-16, this sequence of bytes will be extremely rare at the beginning of text files in other character encodings.

Thorlay answered 18/1, 2019 at 9:31 Comment(1)
@BasilBourque Wasn't aware that one could misread the sentence that way. I made it now more clear what I wanted to say.Thorlay
Z
5

InputStreamReader is decoding the UTF-8 encoded byte sequence (b) into UTF-16BE, and in the process translates the UTF-8 BOM to UTF-16BE BOM (\uFEFF). UTF-16BE is selected as the target encoding because Charset defaults to this behavior:

https://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html

The UTF-16 charsets are specified by RFC 2781; the transformation formats upon which they are based are specified in Amendment 1 of ISO 10646-1 and are also described in the Unicode Standard.

The UTF-16 charsets use sixteen-bit quantities and are therefore sensitive to byte order. In these encodings the byte order of a stream may be indicated by an initial byte-order mark represented by the Unicode character '\uFEFF'. Byte-order marks are handled as follows:

When decoding, the UTF-16BE and UTF-16LE charsets interpret the initial byte-order marks as a ZERO-WIDTH NON-BREAKING SPACE; when encoding, they do not write byte-order marks.

When decoding, the UTF-16 charset interprets the byte-order mark at the beginning of the input stream to indicate the byte-order of the stream but defaults to big-endian if there is no byte-order mark; when encoding, it uses big-endian byte order and writes a big-endian byte-order mark.

See JLS 3.1 to understand why the internal encoding of String is UTF-16:

https://docs.oracle.com/javase/specs/jls/se8/html/jls-3.html#jls-3.1

The Java programming language represents text in sequences of 16-bit code units, using the UTF-16 encoding.

String#getBytes() returns a byte sequence in the platform's default encoding, which appears to be UTF-8 for your system.

Summary

The sequence EF BB BF (UTF-8 BOM) is translated to FE FF (UTF-16BE BOM) when decoding the byte sequence into a String using InputStreamReader, because the encoding of java.lang.String with a default Charset is UTF-16 BE in the presence of a BOM. After replacing the UTF-16BE BOM and calling String#getBytes() the string is decoded into UTF-8 (the default charset for your platform) and you see your original byte sequence without a BOM.

Zimmermann answered 18/1, 2019 at 3:42 Comment(2)
And where does the language demonstrate that it is UTF-16BE, instead of UTF-16-Host?Emanuel
@Emanuel adjusted the answer to explain why UTF-16BE is chosenZimmermann

© 2022 - 2024 — McMap. All rights reserved.