In UTF-16, UTF-16BE, UTF-16LE, is the endian of UTF-16 the computer's endianness?
Asked Answered
I

3

13

UTF-16 is a two-byte character encoding. Exchanging the two bytes' addresses will produce UTF-16BE and UTF-16LE.

But I find the name UTF-16 encoding exists in the Ubuntu gedit text editor, as well as UTF-16BE and UTF-16LE. With a C test program I found my computer is little endian, and UTF-16 is confirmed as same encoding of UTF-16LE.

Also: There are two byte orders of a value (such as integer) in little/big endian computers. Little endian computers will produce little endian values in hardware (except the value produced by Java which always forms a big endian).

While text can be saved as UTF-16LE as well as UTF-16BE in my little endian computer, are characters produced one byte by one byte (such as the ASCII string, reference to [3] and the endianness of UTF-16 just defined by the human -- not as a result of the phenomenon that big endian machines write big endian UTF-16 while little endian machines write little endian UTF-16?

  1. http://www.ibm.com/developerworks/aix/library/au-endianc/
  2. http://teaching.idallen.com/cst8281/10w/notes/110_byte_order_endian.html
  3. ASCII strings and endianness
  4. Is it true that endianness only affects the memory layout of numbers,but not string? This a post of relation between endianness of string and machine.
Intramural answered 11/4, 2016 at 13:24 Comment(3)
"UTF-16" without qualification is Big Endian by default – but this doesn't mean that all applications behave according to the specification.Liquidation
@一二三Thank you!I alert with the difference between character and value.In C# program test,a integer saved in little endian machine is little endian.And it can not be correctly read when it is copyed to a big endian machine because byte address reversed. But for multi-bytes character in C#,does the byte address reversion happened too after copy from one machine to the other?Intramural
@一二三: That's not quite true. UTF-16 without a BOM is big-endian by default, but it will normally have a BOM which defines endianness.Ger
E
18

"is endian of UTF-16 the computer's endianness?"

The impact of your computer's endianness can be looked at from the point of view of a writer or a reader of a file.

If you are reading a file in a -standard- format, then the kind of machine reading it shouldn't matter. The format should be well-defined enough that no matter what the endianness of the reading machine is, the data can still be read correctly.

That doesn't mean the format can't be flexible. With "UTF-16" (when a "BE" or "LE" disambiguation is not used in the format name) the definition allows files to be marked as either big endian or little endian. This is done with something called the "Byte Order Mark" (BOM) in the first two bytes of the file:

https://en.wikipedia.org/wiki/Byte_order_mark

The existence of the BOM gives options to the writer of a file. They might choose to write out the most natural endianness for a buffer in memory, and include a BOM that matched. This wouldn't necessarily be the most efficient format for some other reader. But any program claiming UTF-16 support is supposed to be able to handle it either way.

So yes--the computer's endianness might factor into the endianness choice of a BOM-marked UTF-16 file. Still...a little-endian program is fully able to save a file, label it "UTF-16" and have it be big-endian. As long as the BOM is consistent with the data, it doesn't matter what kind of machine writes or reads it.

...what if there's no BOM?

This is where things get a little hazy.

On the one hand, the Unicode RFC 2781 and Unicode FAQ are clear. They say that a file in "UTF-16" format which starts with neither 0xFF 0xFE nor 0xFE 0xFF is to be interpreted as big endian:

the unmarked form uses big-endian byte serialization by default, but may include a byte order mark at the beginning to indicate the actual byte serialization used.

Yet to know if you have UTF-16-LE, UTF-16-BE, or UTF-16 file with no BOM...you need metadata outside the file telling you which of the three it is. Because there's not always a place to put that data, some programs wound up using heuristics.

Consider something like this from Raymond Chen (2007):

You might decide that programs that generate UTF-16 files without a BOM are broken, but that doesn't mean that they don't exist. For example,

cmd /u /c dir >results.txt

This generates a UTF-16LE file without a BOM.

That's a valid UTF-16LE file, but where would the "UTF-16LE" meta-label be stored? What are the odds someone passes that off by just calling it a UTF-16 file?

Empirically there are warnings about the term. The Wikipedia page for UTF-16 says:

If the BOM is missing, RFC 2781 says that big-endian encoding should be assumed. (In practice, due to Windows using little-endian order by default, many applications similarly assume little-endian encoding by default.)

And unicode.readthedocs.org says:

"UTF-16" and "UTF-32" encoding names are imprecise: depending of the context, format or protocol, it means UTF-16 and UTF-32 with BOM markers, or UTF-16 and UTF-32 in the host endian without BOM. On Windows, "UTF-16" usually means UTF-16-LE.

And further, the Byte-Order-Mark Wikipedia article says:

Clause D98 of conformance (section 3.10) of the Unicode standard states, "The UTF-16 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian."

Whether or not a higher-level protocol is in force is open to interpretation. Files local to a computer for which the native byte ordering is little-endian, for example, might be argued to be encoded as UTF-16LE implicitly. Therefore, the presumption of big-endian is widely ignored.

When those same files are accessible on the Internet, on the other hand, no such presumption can be made. Searching for 16-bit characters in the ASCII range or just the space character (U+0020) is a method of determining the UTF-16 byte order.

So despite the unambiguity of the standard, the context may matter in practice.

As @rici points out, the standard has been around for a while now. Still, it may pay to do double-checks on files claimed as "UTF-16". Or even to consider if you might want to avoid a lot of these issues and embrace UTF-8...

"Should UTF-16 be considered harmful?"

Excurrent answered 11/4, 2016 at 13:46 Comment(9)
UTF-16 is precisely defined in the Unicode standard (at unicode.org) which imho is the default source of information about Unicode.Ger
@Ger if common practice is contrary to the spec, it would be foolish to ignore that fact. I think this answer tiptoes around the issue sufficiently.Monroemonroy
@MarkRansom I did edit to incorporate the-standard-says bit (which I didn't say initially). But it seems that when multiple sources have found it important to mention "interpretations are variable" that a just-the-spec answer which doesn't mention the variances is incomplete.Excurrent
@markRansom: My understanding is that MS utilities do prefix a BOM (Notedpad certainly does), and MS strongly recommends this practice. That would conform to the standard for UTF-16; otherwise, the file would be UTF-16LE. (It might be mislabeled, of course. But generally Windows describes its files rather informally as "Unicode", which doesn't really make any claim about encoding scheme.)Ger
@rici, Windows describes its files as Unicode since at the time they began their support, UTF-16 (actually its predecessor UCS-2) was the only Unicode encoding. You're right about their consistency in using a BOM, even for UTF-8 where it isn't required and again runs counter to the standard.Monroemonroy
@mark: the current standard permits a BOM in utf-8, which means that if the entity starts with U+FEFF, that character is ignored. Windows conforms; newer Unix apps tend to conform but some don't ignore the BOM.Ger
@Ger Is it possible to get relation between the endian of utf-16(not BE or LE) and the endian of machine?In C program,a two bytes integer have endianness issue that depend on endianness of machine.While the ASCII in C is non-endianness,it is deat with one by one byte,character front alway saving at low memory address.However,it can be found somewhere an argument that UTF-16 is depends on endianness of machine and the reason is UTF-16 is multi-bytes character(Reference[1]),but without any program test,unlike the integer(Reference[2]).Intramural
@hao.zhou: If a stream has the format "UTF-16", then it either starts with a BOM or is big-endian. That's what "UTF-16" means. People might incorrectly call their files UTF-16 when they meant UTF-16LE. That would be an error. If a file is labelled as being "UTF-16", then it must conform to the standard. How the string is stored inside the computer is not relevant; internally (like integers) it probably has native endianness. The UTF formats are schemes for interchanging (Transmitting, which is where the T comes from) data between computers.Ger
@HostileFork: I'm all for using UTF-8, but I don't know that an application writer can avoid these issues and "embrace UTF-8" other than by rejecting any data they receive in other formats. That's the privilege of the application writer, of course, but it might be considered unfriendly by someone whose data is in a different format. My inclination would be to require correct labelling of data (although I might put a note in the documentation about the consequences of incorrect labelling).Ger
G
5

The Unicode encoding schemes are defined in section 3.10 of the Unicode standard. The standard defines seven encoding schemes:

  • 8 bit: UTF-8
  • 16 bit: UTF-16BE, UTF-16LE and UTF-16
  • 32 bit: UTF-32BE, UTF-32LE and UTF-32

In the case of the 16- and 32-bit encodings, the three variants differ in endianness, which may be explicit or indicated by starting the string with a Byte Order Mark (BOM) character, U+FEFF:

  • The LE variant is definitely little-endian; the low-order byte is encoded first. No BOM is permitted, so an initial character U+FEFF is a zero-width no-break space.
  • The BE variant is definitely big-endian; the high-order byte is encoded first. As with the LE variant, no BOM is permitted, so an initial character U+FEFF is a zero-width no-break space.
  • The variant without an endian mark may be big- or little-endian. Normally it will start with a BOM which defines the endianness. If there is no BOM, then big-endian encoding is assumed.

If you are going to use 16- or 32-bit encoding schemes for data serialization, it is generally recommended to use the unmarked variants with an explicit BOM. However, UTF-8 is a much more common data interchange format.

Although no endian marker is needed for UTF-8, it is permitted (but not recommended) to start a UTF-8 encoded string with a BOM; this can be used to differentiate between Unicode encoding schemes. Many Windows programs do this, and a U+FEFF at the beginning of a UTF-8 transmission should probably be treated as a BOM (and thus not as Unicode data).

Ger answered 11/4, 2016 at 15:53 Comment(4)
Per Wikipedia: "If the BOM is missing, RFC 2781 says that big-endian encoding should be assumed. (In practice, due to Windows using little-endian order by default, many applications similarly assume little-endian encoding by default.)Excurrent
@HostileFork: If I were a wikipedian, I'd flag that quote (wherever it is; it's not in the BOM article) with "citation needed". What are those "many applications?" RFC2781 was written in 2000 when Unicode was at version 3.0; time has moved on and applications are much more standards aware than they used to be.Ger
@HostileFork I check documents saved as UTF-16 UTF-16LE UTF-16BE in Ubuntu gedit--open files as binary and read one by one byte in python.Exactly,UTF-16 own a U+FFFE(little endian as same as my machine) at fist two bytes in the file,while UTF-16BE or LE did not.This above answer is as the same as what I see untill now. Howerver,the question in "also " need to be thinked more,it seems that my little endian machine produce little endian utf-16 with little endian BOM if I produce strings in Python encoded in utf16,the endianness depend on machine?Intramural
@Intramural I've expanded my answer to try and address your question and incorporate more of the points raised here.Excurrent
C
1

No. Don't you see little endian computers receive packets from internet all the time which is big endian?

The encoding depends on how you write to memory, not how your architecture is.

Chervil answered 11/4, 2016 at 13:32 Comment(2)
Cound help with the question that if I directly create a UTF-16 string in C,the endianness of string depend on endian of machine.The ASCII string is saved with one by one byte because it is one-bytes character,but how about the UTF-16 string(not UTF-16BE or UTF-16LE,reference to rici's answer).I am confusing with that endianness of integer in program C depend on machine endianness while UTF-16BE,UTF-16LE and UTF-16 could be directly created in Python,somehow the restriction of endian removed(both UTF-16BE and LE could be created) and restriction of UTF-16 endian is still existing.Intramural
the endianness of the string does not depend on the endianness of the machine. You can always reverse byte order in any machine, so it's always possible to create a big endian file in a little endian machineChervil

© 2022 - 2024 — McMap. All rights reserved.