NVARCHAR storing characters not supported by UCS-2 encoding on SQL Server
Asked Answered
C

1

2

By SQL Server's documentation (and legacy documentation), a nvarchar field without _SC collation, should use the UCS-2 ENCODING.

Starting with SQL Server 2012 (11.x), when a Supplementary Character (SC) enabled collation is used, these data types store the full range of Unicode character data and use the UTF-16 character encoding. If a non-SC collation is specified, then these data types store only the subset of character data supported by the UCS-2 character encoding.

It also states that the UCS-2 ENCODING stores only the subset characters supported by UCS-2. From wikipedia UCS-2 specification:

UCS-2, uses a single code value [...] between 0 and 65,535 for each character, and allows exactly two bytes (one 16-bit word) to represent that value. UCS-2 thereby permits a binary representation of every code point in the BMP that represents a character. UCS-2 cannot represent code points outside the BMP.

So, by the specifications above, seems that I won't be able to store a emoji like: 😍 which have a value of 0x1F60D (or 128525 in decimal, way above 65535 limit of UCS-2). But on SQL Server 2008 R2 or SQL Server 2019 (both with the default SQL_Latin1_General_CP1_CI_AS COLLATION), on a nvarchar field, it's perfectly stored and returned (although not supported on comparisons with LIKE or =):

enter image description here

SMSS doesn't render emoji correctly, but here is the value copied and pasted from query result: 😍

So my questions are:

  1. Is nvarchar field really using USC-2 on SQL Server 2008 R2 (I also tested on SQL Server 2019, with same non _SC collations and got same results)?

  2. Is Microsoft's documentation of nchar/nvarchar misleading about "then these data types store only the subset of character data supported by the UCS-2 character encoding"?

  3. Does UCS-2 ENCODING support or not code points beyond 65535?

  4. How SQL Server was able to correctly store and retrieve this field's data, when it's outside the support of UCS-2 ENCODING?

NOTE: Server's Collation is SQL_Latin1_General_CP1_CI_AS and Field's Collation is Latin1_General_CS_AS.
NOTE 2: The original question stated tests about SQL Server 2008. I tested and got same results on a SQL Server 2019, with same respective COLLATIONs.
NOTE 3: Every other character I tested, outside UCS-2 supported range, is behaving on the same way. Some are: 𝕂, πŸ˜‚, 𨭎, 𝕬, 𝓰

Catalog answered 3/9, 2020 at 2:2 Comment(1)
Comments are not for extended discussion; this conversation has been moved to chat. – Redroot
B
12

There are several clarifications to make here regarding the MS documentation snippets posted in the question, and for the sample code, for the questions themselves, and for statements made in the comments on the question. Most of the confusion can be cleared up, I believe, by the information provided in the following post of mine:

How Many Bytes Per Character in SQL Server: a Completely Complete Guide

First things first (which is the only way it can be, right?): I'm not insulting the people who wrote the MS documentation as SQL Server alone is a huge product and there is a lot to cover, etc, but for the moment (until I get a chance to update it), please read the "official" documentation with a sense of caution. There are several misstatements regarding Collations / Unicode.

  1. UCS-2 is an encoding that handles a subset of the Unicode character set. It works in 2-byte units. With 2 bytes, you can encode values 0 - 65535. This range of code points is known as the BMP (Basic Multilingual Plane). The BMP is all of the characters that are not Supplementary Characters (because those are supplementary to the BMP), but it does contain a set of code points that are exclusively used to encode Supplementary Characters in UTF-16 (i.e. the 2048 surrogate code points). This is a complete subset of UTF-16.

  2. UTF-16 is an encoding that handles all of the Unicode character set. It also works in 2-byte units. In fact, there is no difference between UCS-2 and UTF-16 regarding the BMP code points and characters. The difference is that UTF-16 makes use of those 2048 surrogate code points in the BMP to create surrogate pairs which are the encodings for all Supplementary Characters. While Supplementary Characters are 4-bytes (in UTF-8, UTF-16, and UTF-32), they are really two 2-byte code units when encoding in UTF-16 (likewise, they are four 1-byte units in UTF-8, and one 4-byte in UTF-32).

  3. Since UTF-16 merely extends what can be done with UCS-2 (by actually defining the usage of the surrogate code points), there is absolutely no difference in the byte sequences that can be stored in either case. All 2048 surrogate code points used to create Supplementary Characters in UTF-16 are valid code points in UCS-2, they just don't have any defined usage (i.e. interpretation) in UCS-2.

  4. NVARCHAR, NCHAR, and the deprecated-so-do-NOT-use-it-NTEXT datatypes all store Unicode characters encoded in UCS-2 / UTF-16. From a storage perspective there is absolutely NO difference. So, it doesn't matter if something (even outside of SQL Server) says that it can store UCS-2. If it can do that, then it can inherently store UTF-16. In fact, while I have not had a chance to update the post linked above, I have been able to store and retrieve, as expected, emojis (most of which are Supplementary Characters) in SQL Server 2000 running on Windows XP. There were no Supplementary Characters defined until 2003, I think, and certainly not in 1999 when SQL Server 2000 was being developed. In fact (again), UCS-2 was only used in Windows / SQL Server because Microsoft pushed ahead with development prior to UTF-16 being finalized and published (and as soon as it was, UCS-2 became obsolete).

  5. The only difference between UCS-2 and UTF-16 is that UTF-16 knows how to interpret surrogate pairs (comprised of a pair of surrogate code points, so at least they're appropriately named). This is where the _SC collations (and, starting in SQL Server 2017, also version _140_ collations which include support for Supplementary Characters so none of them have the _SC in their name) come in: they allow the built-in SQL Server functions to correctly interpret Supplementary Characters. That's it! Those collations have nothing to do with storing and retrieving Supplementary Characters, nor do they even have anything to do with sorting or comparing them (even though the "Collation and Unicode Support" documentation says specifically that this is what those collations do β€” another item on my "to do" list to fix). For collations that have neither _SC nor _140_ in their name (though the new-as-of-SQL Server 2019 Latin1_General_100_BIN2_UTF8 might be grey-area, at least, I remember there being some inconsistency either there or with the Japanese_*_140_BIN2 collations), the built-in functions only handle BMP code points (i.e. UCS-2).

  6. Not "handling" Supplementary Characters means not interpreting a valid sequence of two surrogate code points as actually being a singular supplementary code point. So, for non-"SC" collations, BMP surrogate code point 1 (B1) and BMP surrogate code point 2 (B2) are just those two code points, neither one of which is defined, hence they appear as two "nothing"s (i.e. B1 followed by B2). This is why it is possible to split a Supplementary Character in two using SUBSTRING / LEFT / RIGHT because they won't know to keep those two BMP code points together. But an "SC" collation will read those code points B1 and B2 from disk or memory and see a single Supplementary code point S. Now it can be handled correctly via SUBSTRING / CHARINDEX / etc.

  7. The NCHAR() function (not the datatype; yes, poorly named function ;) is also sensitive to whether or not the default collation of the current database supports Supplementary Characters. If yes, then passing in a value between 65536 and 1114111 (the Supplementary Character range) will return a non-NULL value. If not, then passing in any value above 65535 will return NULL. (Of course, it would be far better if NCHAR() just always worked, given that storing / retrieving always works, so please vote for this suggestion: NCHAR() function should always return Supplementary Character for values 0x10000 - 0x10FFFF regardless of active database's default collation ).

  8. Fortunately, you don't need an "SC" collation to output a Supplementary Character. You can either paste in the literal character, or convert the UTF-16 Little Endian encoded surrogate pair, or use the NCHAR() function to output the surrogate pair. The following works in SQL Server 2000 (using SSMS 2005) running on Windows XP:

    SELECT N'πŸ’©', -- πŸ’©
    CONVERT(VARBINARY(4), N'πŸ’©'), -- 0x3DD8A9DC
    CONVERT(NVARCHAR(10), 0x3DD8A9DC), -- πŸ’© (regardless of DB Collation)
    NCHAR(0xD83D) + NCHAR(0xDCA9) -- πŸ’© (regardless of DB Collation)
    

    For more details on creating Supplementary Characters when using non-"SC" collations, please see my answer to the following DBA.SE question: How do I set a SQL Server Unicode / NVARCHAR string to an emoji or Supplementary Character?

  9. None of this affects what you see. If you store a code point, then it's there. How it behaves β€” sorting, comparison, etc β€” is controlled by collations. But, how it appears is controlled by fonts and the OS. No font can contain all characters, so different fonts contain different sets of characters, with a lot of overlap on the more widely used characters. However, if a font has a particular byte sequence mapped, then it can display that character. This is why the only work required to get Supplementary Characters displaying correctly in SQL Server 2000 (using SSMS 2005) running on Windows XP was to add a font containing the characters and doing one or two minor registry edits (no changes to SQL Server).

  10. Supplementary Characters in SQL_* collations and collations without a version number in their name have no sort weights. Hence, they all equate to each other as well as to any other BMP code points that have no sort weights (including "space" (U+0020) and "null" (U+0000)). They started to fix this in the version _90_ collations.

  11. SSMS has nothing to do with any of this, outside of possibly needing the font used for the query editor and/or grid results and/or errors + messages changed to one that has the desired characters. (SSMS doesn't render anything outside of maybe spatial data; characters are rendered by the display driver + font definitions + maybe something else).

Therefore, the following statement in the documentation (from the question):

If a non-SC collation is specified, then these data types store only the subset of character data supported by the UCS-2 character encoding.

is both nonsensical and incorrect. They were probably intending to say the datatypes would only store a subset of the UTF-16 encoding (since UCS-2 is the subset). Also, even if it said "UTF-16 character encoding" it would still be wrong because the bytes that you pass in will be stored (assuming enough free space in the column or variable).

Bareback answered 3/9, 2020 at 3:21 Comment(10)
Amazing Answer! I really appreciate all the effort and time spend on it. Thank you so much for sharing your knowledge! But... – Catalog
Still, there is one last thing I didn't get about the low level stuff: UTF-16 uses surrogate pairs to extended the possible code points beyond 2bytes (BMP), to 4bytes. Ok, that's how it can stores the big code point of 0x1F60D. How can UCS-2 with only 2 bytes, store a value so higher than 2bytes? The engine splits the big code point of 0x1F60D, and than uses 2 sets of 2bytes? – Catalog
I mean, when I pass the 0x1F60D value which is 4bytes long to an encoding that only maps till 2bytes, how it correctly splits it!? How can it even allow receiving a so much bigger number (i.e. code point)? – Catalog
I learned (and everywhere says it) that the encoding is responsible to map de code point to a byte sequence, and vice-versa. So how can a encoding like UCS-2 that is limited to 2bytes handle (even it not interpreting) a 4bytes value? – Catalog
That expected behavior happens if I use an even more limited encoding, like 1byte Windows-1252. It simply cannot store the big value of 0x1F60D, regardless of how much space I have available. It won't map the 0x1F60D to any valid byte sequence... – Catalog
And on the another answer on the link you provided (another great answer, by the way), you stated that "[...] AND, the 65536 - 1114111 (0x10000 - 0x10FFFF) range of UTF-16 is constructed from two Code Points in the UCS-2 range (ranges 0xD800 – 0xDBFF and 0xDC00 – 0xDFFF, specifically)". Ok, that makes sense, but only if I'm mapping from UCS-2 to UTF-16. As you said, UTF-16 is aware that those bytes need to be combined. But how it can work the other way around? From a big code point of 0x1F60D to two smaller code points? Who does this conversion? – Catalog
@Catalog You are quite welcome. Both UTF-16 and UCS-2 work in 2-byte units. Supplementary characters (e.g. 0x1F60D) are not encoded into UCS-2 as there was never a translation for them. BUT, UTF-16 merely translates that Supplementary code point into two BMP / UCS-2 code units. Don't think of Sup. Chars as 4-bytes, think of them as a total of 4 bytes. UTF-8 uses four 1-byte code units to encode Sup. Chars, but it's still an 8-bit encoding, not a 32-bit encoding. UCS-2 doesn't know what Code Points above U+FFFF are, but it has no problem storing U+D83D and U+DCA9, (continued...) – Bareback
the two BMP code points used to encode the Supplementary code point of U+1F4A9. This is because 0x3DD8A9DC β€” the UCS-2 / UTF-16 Little Endian encoding of those two BMP code points β€” is valid in UCS-2: it's a high-surrogate followed by a low-surrogate. The difference between UTF-16 and UCS-2 is that UCS-2 doesn't see that combination of encoded code points as being anything special, anything other than two surrogate code points that have no meaning. UTF-16 sees that combination and reports back that it's not two code points, but actually one code point that happens to be Supplementary. (cont) – Bareback
Keep in mind that code points are not code units. Code points are just numbers that designate a particular character. They imply nothing regarding how they are stored (i.e.encoded) in memory or on disk. The number of bytes of the code point is irrelevant (technically, the highest code point, U+10FFFF, would only be 3 bytes, if that number represented bytes). Code units are the 1-, 2-, or 4-byte blocks used to physically represent code points by the UTF-8, UTF-16, and UTF-32 encodings, respectively. Code points and code units are the same for BMP characters in UTF-16 and UCS-2, (cont) – Bareback
which can be very confusing, but they aren't the same thing. Finally, the reason why you can store Supplementary Characters in "UCS-2" is because a) you are supplying the encoding (i.e. 0x3DD8A9DC or NCHAR(0xD83D) + NCHAR(0xDCA9)), and b) bytes are just bytes and the font/display driver/OS handle displaying the bytes presented -- they don't care how those bytes were stored. Remember, getting this to work in Windows XP didn't require a change to SSMS, only a change to the registry and a font containing the characters. – Bareback

© 2022 - 2024 β€” McMap. All rights reserved.