What's the point of UTF-16?
Asked Answered
R

5

88

I've never understood the point of UTF-16 encoding. If you need to be able to treat strings as random access (i.e. a code point is the same as a code unit) then you need UTF-32, since UTF-16 is still variable length. If you don't need this, then UTF-16 seems like a colossal waste of space compared to UTF-8. What are the advantages of UTF-16 over UTF-8 and UTF-32 and why do Windows and Java use it as their native encoding?

Rasmussen answered 13/3, 2011 at 20:28 Comment(2)
Perhaps you could rephrase your question to not be so subjective and argumentative?Lewis
If only it was true for UTF-32... Play 5 minutes with combining characters en.wikipedia.org/wiki/Combining_character and the tell me how much "random" everything is :-)Euhemerize
J
65

When Windows NT was designed UTF-16 didn't exist (NT 3.51 was born in 1993, while UTF-16 was born in 1996 with the Unicode 2.0 standard); there was instead UCS-2, which, at that time, was enough to hold every character available in Unicode, so the 1 code point = 1 code unit equivalence was actually true - no variable-length logic needed for strings.

They moved to UTF-16 later, to support the whole Unicode character set; however they couldn't move to UTF-8 or to UTF-32, because this would have broken binary compatibility in the API interface (among the other things).

As for Java, I'm not really sure; since it was released in ~1995 I suspect that UTF-16 was already in the air (even if it wasn't standardized yet), but I think that compatibility with NT-based operating systems may have played some role in their choice (continuous UTF-8 <-> UTF-16 conversions for every call to Windows APIs can introduce some slowdown).


Edit

Wikipedia explains that even for Java it went in the same way: it originally supported UCS-2, but moved to UTF-16 in J2SE 5.0.

So, in general when you see UTF-16 used in some API/Framework it is because it started as UCS-2 (to avoid complications in the string-management algorithms) but it moved to UTF-16 to support the code points outside the BMP, still maintaining the same code unit size.

Jacy answered 13/3, 2011 at 20:36 Comment(0)
J
26

None of the replies indicating an advantage of UTF-16 over UTF-8 make any sense, except for the backwards-compatibility reply.

Well, there are two caveats to my comment.

Erik states: "UTF-16 covers the entire BMP with single units - So unless you have a need for the rarer characters outside the BMP, UTF-16 is effectively 2 bytes per character."

Caveat 1)

If you can be certain that your application will NEVER need any character outside of the BMP, and that any library code you write for use with it will NEVER be used with any application that will ever need a character outside the BMP, then you could use UTF-16, and write code that makes the implicit assumption that every character will be exactly two bytes in length.

That seems exceedingly dangerous (actually, stupid).

If your code assumes that all UTF-16 characters are two bytes in length, and your program interacts with an application or library where there is a single character outside of the BMP, then your code will break. Code that examines or manipulates UTF-16 must be written to handle the case of a UTF-16 character requiring more than 2 bytes; therefore, I am "dismissing" this caveat.

UTF-16 is not simpler to code for than UTF-8 (code for both must handle variable-length characters).

Caveat 2)

UTF-16 MIGHT be more computationally efficient, under some circumstances, if suitably written.

Like this: Suppose that certain long strings are seldom modified, but often examined (or better, never modified once built - i.e., a string builder creating unmodifiable strings). A flag could be set for each string, indicating whether the string contains only "fixed length" characters (i.e., contains no characters that are not exactly two bytes in length). Strings for which the flag is true could be examined with optimized code that assumes fixed length (2 byte) characters.

How about space-efficiency?

UTF-16 is, obviously, more efficient for A) characters for which UTF-16 requires fewer bytes to encode than does UTF-8.

UTF-8 is, obviously, more efficient for B) characters for which UTF-8 requires fewer bytes to encode than does UTF-16.

Except for very "specialized" text, it's likely that count(B) far exceeds count(A).

Jarred answered 5/1, 2014 at 9:11 Comment(5)
"Except for very "specialized" text, it's likely that count(B) far exceeds count(A)." Most of Eastern Asia might disagree, as the majority of their languages fall into 3-byte UTF-8.X
See utf8everywhere.org. Even in their worst case UTF-16 only saved 20%. if storage space is important to you, you should be using an actual compression algorithm and not use it to excuse your bad encoding algorithm. In the vast majority of cases you will be using markup languages like XML/HTML, JSON or Markdown to format your content, all of which operate on ASCII.Forbid
"That seems exceedingly dangerous (actually, stupid)", "[if] your program interacts with an application or library where there is a single character outside of the BMP, then your code will break". Well, that's the general nature of solutions that are defined to solve a certain subset of a problem, rather than the whole of it (for any number of possibly valid reasons). There's nothing inherently "exceedingly dangerous", let alone "stupid" about it. It's called a "trade-off".Gloria
@Gloria It's a trade off that has a quite high likelihood of failing, for a very small amount of benefit. If you need fixed bytes, use UTF-32, and if you need best use of memory, use UTF-8. The number of cases where you need a marginally better memory size, and are willing to make a large assumption about possible text you'll have to deal with seems like the stupidest way to gain very little, and potentially lose quite a lot. It's like chopping a leg off so you have more weight allowance for shoes on a flight, when an extra bag costs $40.Annadiane
@PeterR Arguing with people who use colorful language of vivid exaggerations and emotionally loaded adjectives to describe boring technical choices, without acknowledging the possibility of valid contexts, niches, uses etc., while calling all who do stupid, would be utterly futile. I just wanted to have this in the record for a more balanced discussion.Gloria
P
5

UTF-16 covers the entire BMP with single units - So unless you have a need for the rarer characters outside the BMP, UTF-16 is effectively 2 bytes per character. UTF-32 takes more space, UTF-8 requires variable-length support.

Piero answered 13/3, 2011 at 20:32 Comment(4)
I'll add the necessary wiki reference to UTF-32, that explains all the disadvantages: en.wikipedia.org/wiki/UTF-32/UCS-4Euhemerize
@Piero - You might as well say UTF-8 is effectively one byte per character... unless you need rare characters outside ASCII. In reality, UTF-16 is just as variable-length as UTF-8.Barrel
I work with Japanese characteres (or French), we're actually thinking of using UTF-16. I would have liked this discussion to include how variable those are and if using UTF-16 can be more optimized for different degrees of non-ASCII-nessPlayreader
UTF-8 covers the entire ASCII with single units - So unless you have a need for the rarer characters, UTF-8 is effectively 1 byte per character, not variable-length.Lagoon
P
3

UTF-16 allows all of the basic multilingual plane (BMP) to be represented as single code units. Unicode code points beyond U+FFFF are represented by surrogate pairs.

The interesting thing is that Java and Windows (and other systems that use UTF-16) all operate at the code unit level, not the Unicode code point level. So the string consisting of the single character U+1D122 (MUSICAL SYMBOL F CLEF) gets encoded in Java as "\ud824\udd22" and "\ud824\udd22".length() == 2 (not 1). So it's kind of a hack, but it turns out that characters are not variable length.

The advantage of UTF-16 over UTF-8 is that one would give up too much if the same hack were used with UTF-8.

Pietrek answered 13/3, 2011 at 20:48 Comment(1)
Me thinks (yes, me thinks :-) ) that the world would be better if programmers had to know of variable length characters, instead of discovering them "casually" (as it's now, a programmer could live years without knowing that a code point could be long 2, if everything was UTF-8, he could keep the head under the earth only for some months :-) )Euhemerize
D
0

UTF16 is generally used as a direct mapping to multi-byte character sets, ie onyl the original 0-0xFFFF assigned characters.

This gives you the best of both worlds, you have fixed character size but can still print all the characters anyone is likely to use (orthodox Klingon religous scripts excepted)

Db answered 13/3, 2011 at 20:32 Comment(2)
Unless they're from Hong Kong, as even basic Cantonese sentences can require characters outside of the BMP. Besides which, there's no fun like the fun that can come from having a program reject some valid characters for no reason the end user can see.Imputation
As of today emojis should hit everyone, unbound to a language - one simply has to expect/support surrogates.Maas

© 2022 - 2024 — McMap. All rights reserved.