Isn’t on big endian machines UTF-8's byte order different than on little endian machines? So why then doesn’t UTF-8 require a BOM?
Asked Answered
U

2

40

UTF-8 can contain a BOM. However, it makes no difference as to the endianness of the byte stream. UTF-8 always has the same byte order.

If Utf-8 stored all code-points in a single byte, then it would make sense why endianness doesn’t play any role and thus why BOM isn’t required. But since code points 128 and above are stored using 2, 3 and up to 6 bytes, which means their byte order on big endian machines is different than on little endian machines, so how can we claim Utf-8 always has the same byte order?

Thank you

EDIT:

UTF-8 is byte oriented

I understand that if two byte UTF-8 character C consists of bytes B1 and B2 ( where B1 is first byte and B2 is last byte ), then with UTF-8 those two bytes are always written in the same order ( thus if this character is written to a file on little endian machine LEM, B1 will be first and B2 last. Similarly, if C is written to a file on big endian machine BEM, B1 will still be first and B2 still last).

But what happens when C is written to file F on LEM, but we copy F to BEM and try to read it there? Since BEM automatically swaps bytes ( B1 is now last and B2 first byte ), how will app ( running on BEM ) reading F know whether F was created on BEM and thus order of two bytes wasn’t swapped or whether F was transferred from LEM, in which case BEM automatically swapped the bytes?

I hope question made some sense

EDIT 2:

In response to your edit: big-endian machines do not swap bytes if you ask them to read a byte at a time.

a) Oh, so even though character C is 2 bytes longs, app ( residing on BEM ) reading F will read into memory just one byte at the time ( thus it will first read into memory B1 and only then B2 )

b)

In UTF-8, you decide what to do with a byte based on its high-order bits

Assuming file F has two consequent characters C and C1 ( where C consists of bytes B1 and B2 while C1 has bytes B3, B4 and B5 ). How will app reading F know which bytes belong together simply by checking each byte's high-order bits ( for example, how will it figure out that B1 and B2 taken together should represent a character and not B1,*B2* and B3)?

If you believe that you're seeing something different, please edit your question and include

I’m not saying that. I simply didn’t understand what was going on

c)Why aren't Utf-16 and Utf-32 also byte oriented?

Uke answered 30/9, 2010 at 18:33 Comment(7)
"Byte oriented" means that you read a byte at a time, and decide what to do based on that byte. In UTF-8, you decide what to do with a byte based on its high-order bits. In UTF-16 and UTF-32, by comparison you deal with multiple bytes at a time, and have to organize them into words.Vying
In response to your edit: big-endian machines do not swap bytes if you ask them to read a byte at a time. If you believe that you're seeing something different, please edit your question and include (1) the source and destination machines and operating systems, (2) the exact steps that you're taking to copy the file (copy-paste from your terminal, do not paraphrase), and (3) proof that the file has been changed (for example, by showing byte-level output with od). Oh, and please use some highlight other than code.Vying
uh, for some reason I've only now noticed your first comment.Anyways, will edit my questionsUke
From UTF-8 FAQ (unicode.org/faq/utf_bom.html): Q: What is the definition of UTF-8? A: UTF-8 is the byte-oriented encoding form of Unicode. (links to further details follow).Waldron
@Waldron And with that comment, you mean..?Lepidus
@KorayTugay That there are docs, I guess (comment is from 2012 :))Waldron
UTF-8 is not parameterized by an endianness but the intuition that multi-byte sequences must have some sort of ordering convention is a good one. There is a sort of big endianness baked directly into UTF-8 multi-byte sequences. If you paste all the significant bits of a UTF-8 byte sequence together from left to right and pad with leading zeros, you get a big endian representation of the code point (which is UTF-32 if you pad to 32 bits). Hypothetically, little endian could have been baked in but it would be extremely awkward.Autosuggestion
R
40

The byte order is different on big endian vs little endian machines for words/integers larger than a byte.

e.g. on a big-endian machine a short integer of 2 bytes stores the 8 most significant bits in the first byte, the 8 least significant bits in the second byte. On a little-endian machine the 8 most significant bits will the second byte, the 8 least significant bits in the first byte.

So, if you write the memory content of such a short int directly to a file/network, the byte ordering within the short int will be different depending on the endianness.

UTF-8 is byte oriented, so there's not an issue regarding endianness. the first byte is always the first byte, the second byte is always the second byte etc. regardless of endianness.

Regelation answered 30/9, 2010 at 18:48 Comment(12)
Could you see my edit, since there's still something I don't quite understandUke
Neither BEM nor LEM swap any bytes when you deal with bytes. They'll be swapped if you read more than 1 byte as a larger type, e.g. 2 bytes as a short or 4 bytes as an int then you have to care about which byte goes where within the integerRegelation
Bytes are not "automatically" swapped at all. Depending on endianness they have different meaning (if part of a larger integer), but there is no swapping.Sincere
Can we say that, even a Little - Endian machine has to treat UTF - 8 files as "Big Endian" ? Because it must read the first byte, and based on that byte, if the character that is read is 2 bytes, it will read the next byte. But that is actually Big Endian.. So I think we can say UTF-8 forces Big Endian in a way, no?Lepidus
@Koray Tugay utf-8 needn't know endinness because it always deal with string one byte by one.Besides,store string to bytes is also one by one.Spoilage
@KorayTugay: no, you can't say that. Clearly you don't understand what endian actually is. What you said about Big Endian also applies to Little Endian. If you have a multi-byte integer, you still have to read all of the bytes to complete the integer. It is the interpretation of the order of the bytes that determines the value of the integer.Escobar
@KorayTugay: In any case, if a Unicode codepoint is encoded in UTF-8 using multiple bytes, you have to read the first byte to know how many total bytes are used, which could be 1, 2, 3 or 4 (5+ byte variants of UTF-8 are not used in modern systems). Endian does not apply in UTF-8 at all.Escobar
It would be good if someone added example images or ASCII-tables for thisAshburn
@RemyLebeau Could you please elaborate on this? How can the order of the bytes be determined if we use UTF-8 without the Endianness info? As you mentioned, we need to read the first byte. But how can we know which byte is the first byte? I know that, in UTF-8, the beginning sequence of the bits in the first byte is different than that of the 2nd, 3rd, and 4th byte. So, does the machine need to look through the bytes to find the 1st byte?Debose
@starriet UTF-8 is not subject to endian. The values of its individual codeunits don't span multiple bytes. The byte order is always the same regardless of machine. Consider codepoint U+1F600 (grinning face). In UTF-8, it is four 8bit codeunits 0xF0 0x9F 0x98 0x80, that never changes, and as a byte sequence is F0 9F 98 80, there's no question F0 is the 1st byte. The same codepoint in UTF-16 is two 16-bit codeunits 0xD83D 0xDE00, that never changes, but as a byte sequence that is either D8 3D DE 00 or 3D D8 00 DE depending on endian. That's why UTF-16LE and UTF-16BE variants exist.Escobar
@starriet UTF-8 is not subject to endian. The values of its individual code units don't span multiple bytes. The order of code units in a UTF is always the same regardless of machine, its the bytes of each individual code unit that is subject to endian...Escobar
@starriet Consider codepoint U+1F600 (grinning face). In UTF-8, it is four 8bit code units 0xF0 0x9F 0x98 0x80, that never changes, and as a byte sequence is F0 9F 98 80, so there's no question that F0 is always the 1st byte. The same codepoint in UTF-16 is two 16-bit code units 0xD83D 0xDE00, that never changes, but as a byte sequence that is either D8 3D DE 00 or 3D D8 00 DE depending on endian. That's why UTF-16LE (little endian) and UTF-16BE (big endian) variants exist. Same with UTF-32.Escobar
M
19

To answer c): UTF-16 and UTF-32 represent characters as 16-bit or 32-bit words, so they are not byte-oriented.

For UTF-8, the smallest unit is a byte, thus it is byte-oriented. The alogrithm reads or writes one byte at a time. A byte is represented the same way on all machines.

For UTF-16, the smallest unit is a 16-bit word, and for UTF-32, the smallest unit is a 32-bit word. The algorithm reads or writes one word at a time (2 bytes, or 4 bytes). The order of the bytes in each word is different on big-endian and little-endian machines.

Manado answered 18/4, 2016 at 15:38 Comment(3)
Yes, a byte is represented the same way on all machines. But there are codepoints that require more than one byte, even if we use UTF-8. So, if there are multiple bytes for one codepoint, how these bytes can be represented the same way on all machines?Debose
@starriet Yes, there are multiple bytes involved, but the order of the code units within a UTF encoding does not depend on endian, only the values of the individual code units does. In UTF-8, each individual code unit is 1 byte, and thus their values are not subject to endian. But in UTF-16/UTF-32, each individual code unit is 2/4 bytes each, respectively, and thus their values are subject to endian.Escobar
This was finally the explanation that clicked for me, thank youDrupelet

© 2022 - 2024 — McMap. All rights reserved.