Why is Unicode restricted to 0x10FFFF?
Asked Answered
F

1

21

Why is the maximum Unicode code point restricted to 0x10FFFF? Is it possible to represent Unicode above this code point - for e.g. 0x10FFFF + 0x000001 = 0x110000 - through any encoding schemes like UTF-16, UTF-8?

Filip answered 6/9, 2018 at 11:43 Comment(0)
M
34

It's because of UTF-16. Characters outside of the base multilingual plane (BMP) are represented using a surrogate pair in UTF-16 with the first code unit (CU) lies between 0xD800–0xDBFF and the second one between 0xDC00–0xDFFF. Each of the CU represents 10 bits of the code point, allowing total 20 bits of data (0x100000 characters) which is split into 16 planes (16×216 characters). The remaining BMP will represent 0x10000 characters (code points 0–0xFFFF)

Therefore the total number of characters is 17×216 = 0x100000 + 0x10000 = 0x110000 which allows for code points from 0 to 0x110000 - 1 = 0x10FFFF. Alternatively the last representable code point can be calculated like this: Code points in the BMP are in the range 0–0xFFFF, so the offset for characters encoded with a surrogate pair is 0xFFFF + 1 = 0x10000, which means the last code point that a surrogate pair represents is 0xFFFFF + 0x10000 = 0x10FFFF

That's guaranteed by Unicode Character Encoding Stability Policies that a code point above that will never be assigned

The General_Category property value Surrogate (Cs) is immutable: the set of code points with that value will never change.

Historically UTF-8 allows up to U+7FFFFFFF using 6 bytes whereas UTF-32 can store twice the number of that. However due to the limit in UTF-16 the Unicode committee has decided that UTF-8 can never be longer than 4 bytes, resulting in the same range as UTF-16

In November 2003, UTF-8 was restricted by RFC 3629 to match the constraints of the UTF-16 character encoding: explicitly prohibiting code points corresponding to the high and low surrogate characters removed more than 3% of the three-byte sequences, and ending at U+10FFFF removed more than 48% of the four-byte sequences and all five- and six-byte sequences.

https://en.wikipedia.org/wiki/UTF-8#History

The same has been applied to UTF-32

In November 2003, Unicode was restricted by RFC 3629 to match the constraints of the UTF-16 encoding: explicitly prohibiting code points greater than U+10FFFF (and also the high and low surrogates U+D800 through U+DFFF). This limited subset defines UTF-32

https://en.wikipedia.org/wiki/UTF-32

You can read this more detailed answer and

Miserable answered 6/9, 2018 at 12:15 Comment(11)
to whoever downvoted this: is it too hard to leave a comment if it's wrong?Miserable
I don’t know who downvoted it but your answer is wrong, but not by a huge margin. The range of BMP is between 0x0000-0xD7FF to 0xE000-0xFFFF. Characters in BMP is be represented in 1 code unit of UTF-16. When 2 code units are used, we have 20 bits to encode a character. Hence this set has 0 to 0xFFFFF values. Since these values in Unicode set should start after BMP, whose last value is 0xFFFF. We add 0x10000 to each character in this set to properly index in the Unicode character set. Hence 0x10000 + 0xFFFFF = 0x10FFFFHigley
I voted down this question, because it doesn't answer the question. WHY would unicode ever create something as backward thinking as UTF-16. They were supposed to overcome the problem of fixed size code-point ranges. What happens when you humans make first contact and are expected to add support to the 1500000+ new scripts from the 10000+ new civilizations, are you just going to say "no, sorry, that is not UTF-16 compatible"; this could prompt discussions about the appropriateness of the future of human existence on Earth.Underling
@Underling It already answers the OP's question. It's because of UTF-16. And WHY would unicode ever create something as backward thinking as UTF-16 is your question, not the OP's and that's a completely different one. Downvoting this because it doesn't answer your question is silly. If you want then ask the other question. The oversight of Unicode committee has nothing to do with thisMiserable
@Miserable "Why sky blue? Because sky has air! [enlightenment achieved]", while some may be satisfied with an answer of "sky has air", I am not one of them. However an answer of "Because sky has air, but I do not think anyone knows why air blue." would be an acceptable answer, if a bit hard to believe.Underling
@Underling that's silly comparison. The sky is blue is a scientific fact. But the choice of the Unicode committee isn't scientific at all. If they don't say the rationale why they thought 16-bit is enough then no one can answer that, people can only guess. In any case the OP's question has already been fully answered. Your question is valid but it's completely out of scopeMiserable
@Miserable yes it is silly, to demonstrate my point of view. How about this: you have a spouse and they divorce you with the explanation "you annoyed me" and then they ghost you, would you be satisfied with such an answer. I despise that Unicode is so contradictorily limited, I really want a better answer than, "Unicode has that problem because UTF-16 has that problem". TLDR minor nit pick: that UTF-16 was created with a limitation is an empirical fact just a step short of scientific law, but that does not mandate that the laws of physics explain their purpose to you.Underling
@Miserable I just want acknowledgement that UTF-16 is stupid, or to have the existence of UTF-16 justified, beyond merely "a compromise [....] because some manufacturers were already heavily invested in 2-byte-per-character technology"(source:wikipedia), and why that, alone, is reason to mandate that all future standards must also suffer the same limitation because of a temporary compromise. ASCII is limited to 8-bits, so why not mandate that Unicode must forever be limited to 8 bits as a compatibility compromise, that is just as stupid as what Unicode has done, from my perspective.Underling
@Underling stupid or not is completely irrelevant to the question. This is not a forum for discussion, just a Q&A site. Go ahead and ask the Unicode committee if you want that breaking change. Or go somewhere else to discussMiserable
@Underling you are making up nonsense. Unicode did not set out with the goal of making a limitless text encoding system, because that would have little purpose. Their goal was to enable the worlds languages to all be represented on a computer in a compatible way, which they've achieved. While UTF-16 might be stupid at virtually every level, it's because Microsoft is stupid at virtually every level, but even I would agree that supporting Microsoft has more value right now, here on earth, than instead waiting for aliens to invade so we can encode all of their glyphs too.Expedition
@user3338098: Unicode principle is to be easily backward compatible (and it is the reason Unicode gained popularity). And other initial design principle: all characters can be stored in 16-bit. (so Korean exception, preference of combining characters, etc.). Much systems used UCS-2, so the new system must be compatible with them. See Windows, see JavaScript, etc. (you may notice it about the counting of "characters"). UTF-16 is really the reason.Brezhnev

© 2022 - 2024 — McMap. All rights reserved.